基于场景图知识的文本到图像行人重识别

doi:10.16451/j.cnki.issn1003-6059.202411001

摘要
图/表
参考文献
相关文章 (3)

全文: PDF (1137 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要现有的大多数文本到图像的行人重识别方法对CLIP(Contrastive Language-Image Pretraining)等视觉语言模型进行微调以适应行人重识别任务,并获得预训练模型的强大视觉语言联合表征能力,然而,这些方法通常只考虑对下游重识别任务的任务适应,却忽视由于数据差异所需的数据域适应,难以有效捕获结构化知识(理解对象属性及对象间关系).针对这些问题,基于CLIP-ReID,文中提出基于场景图知识的文本到图像行人重识别方法,采用两阶段训练策略.在第一阶段,冻结CLIP的图像编码器和文本编码器,利用提示学习优化可学习提示词,实现下游数据域与CLIP原始训练数据域的适配,解决数据域适应的问题.在第二阶段,微调CLIP的同时引入语义负采样和场景图编码器模块,先通过场景图生成语义相近的难样本,并引入三元组损失作为额外优化目标,再引入场景图编码器,将场景图作为输入,增强CLIP在第二阶段对结构化知识的获取能力.在3个广泛使用的数据集上验证文中方法的有效性.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	王晋溪
	鲁鸣鸣

关键词 ：场景图, 提示学习, 文本到图像的行人重识别(T2IReID), CLIP

Abstract：：Most existing text-to-image person re-identification methods adapt to person re-identification tasks and obtain strong visual language joint representation capabilities of pre-trained models by fine-tuning visual language models, such as contrastive language-image pretraining(CLIP). These methods only consider the task adaptation for downstream re-identification task, but they ignore the required data adaptation due to data differences and it is still difficult for them to effectively capture structured knowledge, such as understanding object attributes and relationships between objects. To solve these problems, a scene graph knowledge based text-to-image person re-identification method is proposed. A two-stage training strategy is employed. In the first stage, the image encoder and the text encoder of CLIP model are frozen. Prompt learning is utilized to optimize the learnable prompt tokens to make the downstream data domain adapt to the original training data domain of CLIP model. Thus, the domain adaptation problem is effectively solved. In the second stage, while fine-tuning CLIP model, semantic negative sampling and scene graph encoder modules are introduced. First, difficult samples with similar semantics are generated by scene graph, and the triplet loss is introduced as an additional optimization target. Second, the scene graph encoder is introduced to take the scene graph as input, enhancing CLIP ability to acquire structured knowledge in the second stage. The effectiveness of the proposed method is verified on three widely used datasets.

Key words： Scene Graph Prompt Learning Text-to-Image Person Re-identification(T2IReID) Contrastive Language-Image Pretraining(CLIP)

收稿日期: 2024-07-15

ZTFLH:

TP 391.41

基金资助:国家自然科学基金项目(No.U20A20182)资助

通讯作者: 鲁鸣鸣,博士,副教授,主要研究方向为模式识别、深度学习、计算机视觉.E-mail:mingminglu@csu.edu.cn.

作者简介: 王晋溪,硕士研究生,主要研究方向为深度学习、计算机视觉.E-mail:224711027@csu.edu.cn.

引用本文:

王晋溪, 鲁鸣鸣. 基于场景图知识的文本到图像行人重识别[J]. 模式识别与人工智能, 2024, 37(11): 947-959. WANG Jinxi, LU Mingming. Scene Graph Knowledge Based Text-to-Image Person Re-identification. Pattern Recognition and Artificial Intelligence, 2024, 37(11): 947-959.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.202411001 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2024/V37/I11/947