1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650504;
2. Key Laboratory of Artificial Intelligence in Yunnan Province, Kunming University of Science and Technology, Kunming 650504
Current person search methods are predominantly limited to image-based queries and their retrieval accuracy is significantly restricted by the low quality of query images or the incomplete pedestrian features. Furthermore, mainstream methods rely on region proposal networks and non-maximum suppression to generate predefined candidate boxes, making it difficult to achieve end-to-end person search directly from a query to a panoramic gallery. Therefore, a multimodal query-guided end-to-end person search method is proposed. Textual descriptions of pedestrians are introduced as an auxiliary modality to address the limitation of relying solely on visual information. The pedestrian detection and re-identification tasks are jointly optimized within an end-to-end architecture. To enhance the semantic completeness of pedestrian representations, the differentiated semantic information between the query image and the text description is explored and more comprehensive pedestrian information is learned. Then, a cross-modal attention mechanism is utilized to enhance the pedestrian features in the gallery images corresponding to the query information to improve the discriminative ability for pedestrian features. Finally, a detection module based on Transformer is adopted. It discards the traditional region proposal networks and non-maximum suppression pipeline, and directly outputs the final person search results. Experiments on the challenging datasets demonstrate the superior performance of the proposed method.
[1] ZHENG L, ZHANG H H, SUN S Y, et al. Person Re-identification in the Wild // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 3346-3355.
[2] DONG W K, ZHANG Z X, SONG C F, et al. Instance Guided Proposal Network for Person Search // Proc of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 2582-2591.
[3] LI Z J, MIAO D Q.Sequential End-to-End Network for Efficient Person Search. Proc of the AAAI Conference on Artificial Intelligence, 2021, 35(3): 2011-2019.
[4] LEE S, OH Y, BAEK D, et al. PLoPS: Localization-Aware Person Search with Prototypical Normalization. Pattern Recognition, 2024, 153. DOI: 10.1016/j.patcog.2024.110479.
[5] XIAO T, LI S, WANG B C, et al. Joint Detection and Identification Feature Learning for Person Search // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 3376-3385.
[6] ZHU X Z, SU W J, LU L W, et al. Deformable DETR: Deformable Transformers for End-to-End Object Detection[C/OL].[2025-09-03]. https://arxiv.org/pdf/2010.04159.
[7] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.
[8] YAN Y C, LI J P, QIN J, et al. Anchor-Free Person Search // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 7686-7695.
[9] 谢明鸿,康斌,李华锋,等.Anchor Free与Anchor Base算法结合的拥挤行人检测方法.电子与信息学报, 2023, 45(5): 1833-1841.
(XIE M H, KANG B, LI H F, et al. Crowded Pedestrian Detection Method Combining Anchor Free and Anchor Base Algorithm. Jour-nal of Electronics and Information Technology, 2023, 45(5): 1833-1841.)
[10] 石林波,李华锋,张亚飞,等.模态不变性特征学习和一致性细粒度信息挖掘的跨模态行人重识别. 模式识别与人工智能, 2022, 35(12): 1064-1077.
(SHI L B, LI H F, ZHANG Y F, et al. Modal Invariance Feature Learning and Consistent Fine-Grained Information Mining Based Cross-Modal Person Re-identification. Pattern Recognition and Artificial Intelligence, 2022, 35(12): 1064-1077.)
[11] 万磊,李华锋,张亚飞.多模态特征融合和自蒸馏的红外-可见光行人重识别.计算机辅助设计与图形学学报, 2024, 36(7): 1065-1076.
(WAN L, LI H F, ZHANG Y F.Infrared-Visible Person Re-identification via Multi-modality Feature Fusion and Self-Distillation. Journal of Computer-Aided Design and Computer Graphics, 2024, 36(7): 1065-1076.)
[12] 李玲莉,谢明鸿,李凡,等.低秩先验引导的无监督域自适应行人重识别.重庆大学学报, 2021, 44(11): 57-70.
(LI L L, XIE M H, LI F, et al. Unsupervised Domain Adaptive Person Re-identification Guided by Low-Rank Priori. Journal of Chongqing University, 2021, 44(11): 57-70.)
[13] 毛彦嵋,李华锋,张亚飞.面向跨区域场景的无监督域自适应行人重识别.上海交通大学学报[J/OL].[2025-09-03]. DOI: 10.16183/j.cnki.jsjtu.2023.635.
(MAO Y M, LI H F, ZHANG Y F. Unsupervised Domain Adaptation for Cross-Regional Scenes Person Re-identification. Journal of Shanghai Jiaotong University[J/OL].[2025-09-03]. DOI: 10.16183/j.cnki.jsjtu.2023.635.)
[14] LI H F, MAO Y M, ZHANG Y F, et al. Domain-Adaptive Person Re-identification without Cross-Camera Paired Samples. Enginee-ring Applications of Artificial Intelligence, 2025, 145. DOI: 10.1016/j.engappai.2025.110171.
[15] ZHANG Y F, KONG L Q, LI H F, et al. Weakly Supervised Vi-sible-Infrared Person Re-identification via Heterogeneous Expert Collaborative Consistency Learning // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2025: 12659-12669.
[16] LI H F, LIU Y X, ZHANG Y F, et al. Breaking the Paired Sample Barrier in Person Re-identification: Leveraging Unpaired Samples for Domain Generalization. IEEE Transactions on Information Forensics and Security, 2025, 20: 2357-2371.
[17] KIM M, KIM S, SOHN K.Mixture of Submodules for Domain Adaptive Person Search // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2025: 13990-14001.
[18] ZHU H Y, YANG X, WANG N N.Optimizing Label Assignment for Weakly Supervised Person Search. Proc of the AAAI Confe-rence on Artificial Intelligence, 2025, 39(10): 10941-10949.
[19] CAO J L, PANG Y W, ANWER R M, et al. PSTR: End-to-End One-Step Person Search with Transformers // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 9448-9457.
[20] LIU H, FENG J S, JIE Z Q, et al. Neural Person Search Machines // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2017: 493-501.
[21] MUNJAL B, AMIN S, TOMBARI F, et al. Query-Guided End-to-End Person Search // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 811-820.
[22] DONG W K, ZHANG Z X, SONG C F, et al. Bi-directional Interaction Network for Person Search // Proc of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 2836-2845.
[23] JAFFE L, ZAKHOR A.Swap Path Network for Robust Person Search Pre-training // Proc of the IEEE/CVF Winter Conference on Applications of Computer Vision. Washington, USA: IEEE, 2025: 9291-9301.
[24] ZHANG S Z, CHENG D, LUO W L, et al. Text-Based Person Search in Full Images via Semantic-Driven Proposal Generation // Proc of the 4th International Workshop on Human-Centric Multimedia Analysis. New York, USA: ACM, 2023: 5-14.
[25] HE K M, ZHANG X Y, REN S Q, et al. Deep Residual Learning for Image Recognition // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 770-778.
[26] YE M, RUAN W J, DU B, et al. Channel Augmented Joint Lear-ning for Visible-Infrared Recognition // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 13547-13556.
[27] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding // Proc of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Long and Short Papers). Stroudsburg, USA: ACL, 2019, I: 4171-4186.
[28] CHEN K, WANG J Q, PANG J M, et al. MMDetection: Open M-MLAB Detection Toolbox and Benchmark[C/OL].[2025-09-03]. https://arxiv.org/pdf/1906.07155.
[29] CHEN D, ZHANG S S, OUYANG W L, et al. Person Search via a Mask-Guided Two-Stream CNN Model // Proc of the 15th European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 764-781.
[30] LAN X, ZHU X T, GONG S G.Person Search by Multi-scale Matching // Proc of the 15th European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 553-569.
[31] HAN C C, YE J C, ZHONG Y S, et al. Re-ID Driven Localization Refinement for Person Search // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2019: 9814-9823.
[32] WANG C, MA B P, CHANG H, et al. TCTS: A Task-Consistent Two-Stage Framework for Person Search // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 11949-11958.
[33] XIAO J M, XIE Y C, TILLO T, et al. IAN: The Individual Aggre-gation Network for Person Search. Pattern Recognition, 2019, 87: 332-340.
[34] YAN Y C, ZHANG Q, NI B B, et al. Learning Context Graph for Person Search // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 2153-2162.
[35] ZHONG Y J, WANG X Y, ZHANG S L.Robust Partial Matching for Person Search in the Wild // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 6826-6834.
[36] CHEN D, ZHANG S S, YANG J, et al. Norm-Aware Embedding for Efficient Person Search // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 12612-12621.
[37] KIM H, JOUNG S, KIM I, et al. Prototype-Guided Saliency Feature Learning for Person Search // Proc of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 4863-4872.
[38] HAN C C, ZHENG Z D, GAO C X, et al. Decoupled and Memory-Reinforced Networks: Towards Effective Feature Learning for One-Step Person Search. Proc of the AAAI Conference on Artificial Intelligence, 2021, 35(2): 1505-1512.
[39] HAN B, KO K, SIM J.End-to-End Trainable Trident Person Search Network Using Adaptive Gradient Propagation // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 905-913.
[40] LEE S, OH Y, BAEK D, et al. OIMNet++: Prototypical Normalization and Localization-Aware Learning for Person Search // Proc of the 17th European Conference on Computer Vision. Berlin, Germany: Springer, 2022: 621-637.
[41] YU R, DU D W, LALONDE R, et al. Cascade Transformers for End-to-End Person Search // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 7257-7266.
[42] FENG C, HAN D Z, CHEN C Q.DTHN: Dual-Transformer Head End-to-End Person Search Network. Computers, Materials and Continua, 2023, 77(1): 245-261.
[43] ZHANG P C, YU X H, BAI X, et al. Joint Discriminative Representation Learning for End-to-End Person Search. Pattern Recognition, 2024, 147. DOI: 10.1016/j.patcog.2023.110053.