Abstract::Most existing text-to-image person re-identification methods adapt to person re-identification tasks and obtain strong visual language joint representation capabilities of pre-trained models by fine-tuning visual language models, such as contrastive language-image pretraining(CLIP). These methods only consider the task adaptation for downstream re-identification task, but they ignore the required data adaptation due to data differences and it is still difficult for them to effectively capture structured knowledge, such as understanding object attributes and relationships between objects. To solve these problems, a scene graph knowledge based text-to-image person re-identification method is proposed. A two-stage training strategy is employed. In the first stage, the image encoder and the text encoder of CLIP model are frozen. Prompt learning is utilized to optimize the learnable prompt tokens to make the downstream data domain adapt to the original training data domain of CLIP model. Thus, the domain adaptation problem is effectively solved. In the second stage, while fine-tuning CLIP model, semantic negative sampling and scene graph encoder modules are introduced. First, difficult samples with similar semantics are generated by scene graph, and the triplet loss is introduced as an additional optimization target. Second, the scene graph encoder is introduced to take the scene graph as input, enhancing CLIP ability to acquire structured knowledge in the second stage. The effectiveness of the proposed method is verified on three widely used datasets.
王晋溪, 鲁鸣鸣. 基于场景图知识的文本到图像行人重识别[J]. 模式识别与人工智能, 2024, 37(11): 947-959.
WANG Jinxi, LU Mingming. Scene Graph Knowledge Based Text-to-Image Person Re-identification. Pattern Recognition and Artificial Intelligence, 2024, 37(11): 947-959.
[1] LI S, XIAO T, LI H S, et al. Person Search with Natural Language Description // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 5187-5196. [2] ZHANG Y, LU H C. Deep Cross-Modal Projection Learning for Image-Text Matching // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 707-723. [3] WU Y S, YAN Z Z, HAN X G, et al. LapsCore: Language-Guided Person Search via Color Reasoning // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 1604-1613. [4] SHU X J, WEN W, WU H Q, et al. See Finer, See More: Implicit Modality Alignment for Text-Based Person Retrieval // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2023: 624-641. [5] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding // Proc of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies(Long and Short Papers). Stroudsburg, USA: ACL, 2019: 4171-4186. [6] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale[C/OL].[2024-09-21]. https://arxiv.org/pdf/2010.11929. [7] RADFORD A, KIM J W, HALLACY C, et al. Learning Transfe-rable Visual Models from Natural Language Supervision // Proc of the 38th International Conference on Machine Learning. San Diego, USA: JMLR, 2021: 8748-8763. [8] LIN Z Q, CHEN X Y, PATHAK D, et al. VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores[C/OL].[2024-09-21]. https://arxiv.org/abs/2306.01879v1. [9] HUANG Y F, TANG J J, CHEN Z, et al. Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(3): 2417-2425. [10] LI S Y, SUN L, LI Q L. CLIP-ReID: Exploiting Vision-Language Model for Image Re-identification without Concrete Text Labels. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, 37(1): 1405-1413. [11] JIANG D, YE M. Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 2787-2797. [12] CAO C Y, CAI G Y, JIANG X Y, et al. Contextual Non-local Alignment over Full-Scale Representation for Text-Based Person Search[C/OL].[2024-09-21]. https://arxiv.org/pdf/2101.03036. [13] WANG C J, LUO Z M, LIN Y, et al. Text-Based Person Search via Multi-granularity Embedding Learning // Proc of the 30th International Joint Conference on Artificial Intelligence. San Francisco, USA: IJCAI, 2021: 1068-1074. [14] SHAO Z Y, ZHANG X Y, FANG M, et al. Learning Granularity-Unified Representations for Text-to-Image Person Re-identification // Proc of the 30th ACM International Conference on Multimedia. New York, USA: ACM, 2022: 5566-5574. [15] HAN X, HE S, ZHANG L, et al. Text-Based Person Search with Limited Data[C/OL].[2024-09-21]. https://arxiv.org/pdf/2110.10807. [16] YAN S L, DONG N, ZHANG L Y, et al. CLIP-Driven Fine-Grained Text-Image Person Re-identification. IEEE Transactions on Image Processing, 2023, 32: 6032-6046. [17] PETRONI F, ROCKTÄSCHEL T, LEWIS P, et al. Language Models as Knowledge Bases? // Proc of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg, USA: ACL, 2019: 2463-2473. [18] JIANG Z B, XU F F, ARAKI J, et al. How Can We Know What Language Models Know? Transactions of the Association for Computational Linguistics, 2020, 8: 423-438. [19] ZHOU K Y, YANG J K, LOY C C, et al. Learning to Prompt for Vision-Language Models. International Journal of Computer Vision, 2022, 130(9): 2337-2348. [20] ZHOU K Y, YANG J K, LOY C C, et al. Conditional Prompt Learning for Vision-Language Models // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 16795-16804. [21] GE C J, HUANG R, XIE M X, et al. Domain Adaptation via Prompt Learning. IEEE Transactions on Neural Networks and Learning Systems. 2023. DOI: 10.1109/TNNLS.2023.3327962. [22] GU X Y, LIN TT Y, KUO W C, et al. Open-Vocabulary Object Detection via Vision and Language Knowledge Distillation[C/OL].[2024-09-21]. https://arxiv.org/pdf/2104.13921. [23] RAO Y M, ZHAO W L, CHEN G Y, et al. DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 18061-18070. [24] BALDRATI A, BERTINI M, URICCHIO T, et al. Effective Conditioned and Composed Image Retrieval Combining CLIP-Based Features // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 21434-21442. [25] VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press,2017: 6000-6010. [26] SENNRICH R, HADDOW B, BIRCH A. Neural Machine Translation of Rare Words with Subword Units // Proc of the 54th Annual Meeting of the Association for Computational Linguistics (Long Papers). Stroudsburg, USA: ACL, 2016: 1715-1725. [27] JOHNSON J, KRISHNA R, STARK M, et al. Image Retrieval Using Scene Graphs // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2015: 3668-3678. [28] LUO H, GU Y Z, LIAO X Y, et al. Bag of Tricks and a Strong Baseline for Deep Person Re-identification // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Washington, USA: IEEE, 2019: 1487-1495. [29] ZHU A C, WANG Z J, LI Y F, et al. DSSL: Deep Surroundings-Person Separation Learning for Text-Based Person Retrieval // Proc of the 29th ACM International Conference on Multimedia. New York, USA: ACM, 2021: 209-217. [30] WEI L H, ZHANG S L, GAO W, et al. Person Transfer GAN to Bridge Domain Gap for Person Re-identification // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 79-88. [31] GLOROT X, BENGIO Y. Understanding the Difficulty of Training Deep Feedforward Neural Networks // Proc of the 30th International Conference on Artificial Intelligence and Statistics. San Diego, USA: JMLR, 2010: 249-256. [32] SUO W, SUN M Y, NIU K, et al. A Simple and Robust Correlation Filtering Method for Text-Based Person Search // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2022: 726-742. [33] FAROOQ A, AWAIS M, KITTLER J, et al. AXM-Net: Implicit Cross-Modal Feature Alignment for Person Re-identification. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(4): 4477-4485. [34] NIU K, HUANG T, HUANG L J, et al. Improving Inconspicuous Attributes Modeling for Person Search by Language. IEEE Transactions on Image Processing, 2023, 32: 3429-3441. [35] BAO L P, WEI L H, ZHOU W G, et al. Multi-granularity Matching Transformer for Text-Based Person Search. IEEE Transactions on Multimedia, 2024, 26: 4281-4293. [36] HAN G, LIN M, LI Z Y, et al. Text-to-Image Person Re-identification Based on Multimodal Graph Convolutional Network. IEEE Transactions on Multimedia, 2024, 26: 6025-6036. [37] GAN W J, LIU J W, ZHU Y C, et al. Cross-Modal Semantic Alignment Learning for Text-Based Person Search // Proc of the 30th International Conference on Multimedia Modeling. Berlin, Germany: Springer, 2024: 201-215. [38] XUE J Y, WANG Z J, DONG G N, et al. EESSO: Exploiting Extreme and Smooth Signals via Omni-Frequency Learning for Text-based Person Retrieval. Image and Vision Computing, 2024, 142. DOI: 10.1016/j.imavis.2024.104912. [39] HE S T, LUO H, JIANG W, et al. VGSG: Vision-Guided Semantic-Group Network for Text-Based Person Search. IEEE Transactions on Image Processing, 2024, 33, 163-176. [40] ZUO J L, ZHOU H Y, NIE Y, et al. UFineBench: Towards Text-Based Person Retrieval with Ultra-Fine Granularity // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 22010-22019. [41] VAN DEN OORD A, LI Y Z, VINYALS O. Representation Lear-ning with Contrastive Predictive Coding[C/OL].[2024-09-21]. https://arxiv.org/pdf/1807.03748. [42] VAN DER MAATEN L, HINTON G. Visualizing Data Using t-SNE. Journal of Machine Learning Research, 2008, 9: 2579-2605.