模式识别与人工智能
2025年4月17日 星期四   首 页     期刊简介     编委会     投稿指南     伦理声明     联系我们                                                                English
模式识别与人工智能  2024, Vol. 37 Issue (11): 947-959    DOI: 10.16451/j.cnki.issn1003-6059.202411001
面向视觉的目标识别与追踪 最新目录| 下期目录| 过刊浏览| 高级检索 |
基于场景图知识的文本到图像行人重识别
王晋溪1, 鲁鸣鸣1
1.中南大学 计算机学院 长沙 410083
Scene Graph Knowledge Based Text-to-Image Person Re-identification
WANG Jinxi1, LU Mingming1
1. School of Computer Science and Engineering, Central South University, Changsha 410083

全文: PDF (1137 KB)   HTML (1 KB) 
输出: BibTeX | EndNote (RIS)      
摘要 现有的大多数文本到图像的行人重识别方法对CLIP(Contrastive Language-Image Pretraining)等视觉语言模型进行微调以适应行人重识别任务,并获得预训练模型的强大视觉语言联合表征能力,然而,这些方法通常只考虑对下游重识别任务的任务适应,却忽视由于数据差异所需的数据域适应,难以有效捕获结构化知识(理解对象属性及对象间关系).针对这些问题,基于CLIP-ReID,文中提出基于场景图知识的文本到图像行人重识别方法,采用两阶段训练策略.在第一阶段,冻结CLIP的图像编码器和文本编码器,利用提示学习优化可学习提示词,实现下游数据域与CLIP原始训练数据域的适配,解决数据域适应的问题.在第二阶段,微调CLIP的同时引入语义负采样和场景图编码器模块,先通过场景图生成语义相近的难样本,并引入三元组损失作为额外优化目标,再引入场景图编码器,将场景图作为输入,增强CLIP在第二阶段对结构化知识的获取能力.在3个广泛使用的数据集上验证文中方法的有效性.
服务
把本文推荐给朋友
加入我的书架
加入引用管理器
E-mail Alert
RSS
作者相关文章
王晋溪
鲁鸣鸣
关键词 场景图提示学习文本到图像的行人重识别(T2IReID)CLIP    
Abstract::Most existing text-to-image person re-identification methods adapt to person re-identification tasks and obtain strong visual language joint representation capabilities of pre-trained models by fine-tuning visual language models, such as contrastive language-image pretraining(CLIP). These methods only consider the task adaptation for downstream re-identification task, but they ignore the required data adaptation due to data differences and it is still difficult for them to effectively capture structured knowledge, such as understanding object attributes and relationships between objects. To solve these problems, a scene graph knowledge based text-to-image person re-identification method is proposed. A two-stage training strategy is employed. In the first stage, the image encoder and the text encoder of CLIP model are frozen. Prompt learning is utilized to optimize the learnable prompt tokens to make the downstream data domain adapt to the original training data domain of CLIP model. Thus, the domain adaptation problem is effectively solved. In the second stage, while fine-tuning CLIP model, semantic negative sampling and scene graph encoder modules are introduced. First, difficult samples with similar semantics are generated by scene graph, and the triplet loss is introduced as an additional optimization target. Second, the scene graph encoder is introduced to take the scene graph as input, enhancing CLIP ability to acquire structured knowledge in the second stage. The effectiveness of the proposed method is verified on three widely used datasets.
Key wordsScene Graph    Prompt Learning    Text-to-Image Person Re-identification(T2IReID)    Contrastive Language-Image Pretraining(CLIP)   
收稿日期: 2024-07-15     
ZTFLH: TP 391.41  
基金资助:国家自然科学基金项目(No.U20A20182)资助
通讯作者: 鲁鸣鸣,博士,副教授,主要研究方向为模式识别、深度学习、计算机视觉.E-mail:mingminglu@csu.edu.cn.   
作者简介: 王晋溪,硕士研究生,主要研究方向为深度学习、计算机视觉.E-mail:224711027@csu.edu.cn.
引用本文:   
王晋溪, 鲁鸣鸣. 基于场景图知识的文本到图像行人重识别[J]. 模式识别与人工智能, 2024, 37(11): 947-959. WANG Jinxi, LU Mingming. Scene Graph Knowledge Based Text-to-Image Person Re-identification. Pattern Recognition and Artificial Intelligence, 2024, 37(11): 947-959.
链接本文:  
http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.202411001      或     http://manu46.magtech.com.cn/Jweb_prai/CN/Y2024/V37/I11/947
版权所有 © 《模式识别与人工智能》编辑部
地址:安微省合肥市蜀山湖路350号 电话:0551-65591176 传真:0551-65591176 Email:bjb@iim.ac.cn
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn