1.哈尔滨工程大学 计算机科学与技术学院 哈尔滨 150001;
2.哈尔滨工程大学 电子政务建模仿真国家工程实验室 哈尔滨 150001;
3.College of Design and Engineering, National University of Singapore, Singapore 117575
Dynamic Semantic Clustering Relation Modeling Method for Object Tracking
NIE Guohao1, WANG Xingmei1,2, XU Yuezhu1, YANG Wentao3
1. College of Computer Science and Technology, Harbin Engineering University, Harbin 150001;
2. National Engineering Laboratory for Modeling and Emulation in E-Government, Harbin Engineering University, Harbin 150001;
3. College of Design and Engineering, National University of Singapore, Singapore 117575
When the Transformer based object tracking method employs the global attention mechanism to model the spatial relations between the search area and the template, the target deformation can lead to the degradation of feature discriminability, causing confusion between the target and the background. To solve this problem, a dynamic semantic clustering relation modeling method for object tracking is proposed. First, a semantic relation modeling module is constructed. Local attention mechanisms in the feature space are employed to concentrate on semantically similar feature vectors, thereby effectively suppressing erroneous interactions between the target and the distracting background. Second, graph neural networks are utilized to capture local correlations and design a dynamic semantic clustering module. The module adaptively generates semantic category partitions, enabling dynamic attention mechanisms to enhance the discriminative information between the target and the background. Finally, a semantic background elimination strategy is designed to effectively suppress the interference from background features during relationship modeling, thereby improving tracking efficiency. Experimental results on six benchmark datasets demonstrate the superiority of the proposed method.
[1] 卢湖川,李佩霞,王栋.目标跟踪算法综述.模式识别与人工智能, 2018, 31(1): 61-76.
(LU H C, LI P X, WANG D.Visual Object Tracking: A Survey. Pattern Recognition and Artificial Intelligence, 2018, 31(1): 61-76.)
[2] 杜晨杰,杨宇翔,伍瀚,等.旋转自适应的多特征融合多模板学习视觉跟踪算法.模式识别与人工智能, 2021, 34(9): 787-797.
(DU C J, YANG Y X, WU H, et al. Visual Tracking Algorithm Based on Rotation Adaptation, Multi-feature Fusion and Multi-template Learning. Pattern Recognition and Artificial Intelligence, 2021, 34(9): 787-797.)
[3] 姜文涛,刘晓璇,涂潮,等.空间异常适应性的目标跟踪.模式识别与人工智能, 2021, 34(5): 473-484.
(JIANG W Y, LIU X X, TU C, et al. Spatially Abnormal Adaptive Target Tracking. Pattern Recognition and Artificial Intelligence, 2021, 34(5): 473-484.)
[4] BERTINETTO L, VALMADRE J, HENRIQUES J F, et al. Fully-Convolutional Siamese Networks for Object Tracking // Proc of the European Conference on Computer Vision. Berlin, Germany: Sprin-ger, 2016: 850-865.
[5] LI B, WU W, WANG Q, et al. SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 4277-4286.
[6] MAYER C, DANELLJAN M, PAUDEL D P, et al. Learning Target Candidate Association to Keep Track of What not to Track // Proc of the IEEE/CVF International Conference on Computer Vision. Wa-shington, USA: IEEE, 2021: 13424-13434.
[7] CHEN X, YAN B, ZHU J W, et al. Transformer Tracking // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Re-cognition. Washington, USA: IEEE, 2021: 8122-8131.
[8] YAN B, PENG H W, FU J L, et al. Learning Spatio-Temporal Transformer for Visual Tracking // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 10428-10437.
[9] XIE F, WANG C Y, WANG G T, et al. Learning Tracking Representations via Dual-Branch Fully Transformer Networks // Proc of the IEEE/CVF International Conference on Computer Vision Workshops. Washington, USA: IEEE, 2021: 2688-2697.
[10] LIN L T, FAN H, ZHANG Z P, et al. SwinTrack: A Simple and Strong Baseline for Transformer Tracking // Proc of the 36th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2022: 16743-16754.
[11] ZHANG H L, LIU M D, SONG X H, et al. Spatial Attention Inference Model for Cascaded Siamese Tracking with Dynamic Resi-dual Update Strategy. Computer Vision and Image Understanding, 2024, 248. DOI: 10.1016/j.cviu.2024.104125.
[12] GAO L, CHEN L K, LIU P, et al. Transformer-Based Visual Object Tracking via Fine-Coarse Concatenated Attention and Cross Concatenated MLP. Pattern Recognition, 2024, 146. DOI: 10.1016/j.patcog.2023.109964.
[13] NIE G H, WANG X M, YAN Z N, et al. Temporal Relation Transformer for Robust Visual Tracking with Dual-Memory Lear-ning. Applied Soft Computing, 2024, 167(A). DOI: 10.1016/j.asoc.2024.112229.
[14] MARVASTI-ZADEH S M, CHENG L, GHANEI-YAKHDAN H, et al. Deep Learning for Visual Tracking: A Comprehensive Survey. IEEE Transactions on Intelligent Transportation Systems, 2022, 23(5): 3943-3968.
[15] YE B T, CHANG H, MA B P, et al. Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2022: 341-357.
[16] LAN J P, CHENG Z Q, HE J Y, et al. ProContEXT: Exploring Progressive Context Transformer for Tracking // Proc of the IEEE International Conference on Acoustics, Speech and Signal Proce-ssing. Washington, USA: IEEE, 2023. DOI: 10.1109/ICASSP49357.2023.10094971.
[17] SHI L T, ZHONG B N, LIANG Q H, et al. Explicit Visual Prompts for Visual Object Tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(5): 4838-4846.
[18] XIE J X, ZHONG B N, MO Z Y, et al. Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 19300-19309.
[19] ZHU J W, CHEN X, DIAO H W, et al. Exploring Dynamic Transformer for Efficient Object Tracking. IEEE Transactions on Neural Networks and Learning Systems, 2025. DOI: 10.1109/TNNLS.2025.3545752.
[20] WANG Q W, ZHOU L Y, JIN P C, et al. TrackingMamba: Visual State Space Model for Object Tracking. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2024, 17: 16744-16754.
[21] ZENG W, JIN S, LIU W T, et al. Not All Tokens Are Equal: Human-Centric Visual Analysis via Token Clustering Transformer // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 11091-11101.
[22] FU Z H, FU Z H, LIU Q J, et al. SparseTT: Visual Tracking with Sparse Transformers // Proc of the 31st International Joint Confe-rence on Artificial Intelligence. San Francisco, USA: IJCAI, 2022: 905-912.
[23] CUI Y T, JIANG C, WU G S, et al. MixFormer: End-to-End Tracking with Iterative Mixed Attention. IEEE Transactions on Pa-ttern Analysis and Machine Intelligence, 2024, 46(6): 4129-4146.
[24] GAO S Y, ZHOU C L, MA C, et al. AiATrack: Attention in Atten-tion for Transformer Visual Tracking // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2022: 146-164.
[25] GAO S Y, ZHOU C L, ZHANG J.Generalized Relation Modeling for Transformer Tracking // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 18686-18695.
[26] WANG J, YANG S, WANG Y Y.Dynamic Region-Aware Transformer Backbone Network for Visual Tracking. Engineering Applications of Artificial Intelligence, 2024, 133. DOI: 10.1016/j.engappai.2024.108329.
[27] ZHANG S, ZHANG D, ZOU Q.ATPTrack: Visual Tracking with Alternating Token Pruning of Dynamic Templates and Search Region. Neurocomputing, 2025. DOI: 10.1016/j.neucom.2025.129534.
[28] RAO Y M, ZHAO W L, LIU B L, et al. DynamicVIT: Efficient Vision Transformers with Dynamic Token Sparsification // Proc of the 35th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2021: 13937-13949.
[29] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale[C/OL].[2025-02-10]. https://arxiv.org/pdf/2010.11929.
[30] DU M J, DING S, JIA H J.Study on Density Peaks Clustering Based on k-Nearest Neighbors and Principal Component Analysis. Knowledge-Based Systems, 2016, 99: 135-145.
[31] MELAS-KYRIAZI L, RUPPRECHT C, LAINA I, et al. Deep Spec-tral Methods: A Surprisingly Strong Baseline for Unsupervised Semantic Segmentation and Localization // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 8354-8365.
[32] FAN H, LIN L T, YANG F, et al. LaSOT: A High-Quality Ben-chmark for Large-Scale Single Object Tracking // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 5369-5378.
[33] HUANG L H, ZHAO X, HUANG K Q.GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(5): 1562-1577.
[34] MÜLLER M, BIBI A, GIANCOLA S, et al. TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 300-317.
[35] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common Objects in Context // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2014: 740-755.
[36] WU Y, LIM J, YANG M H.Online Object Tracking: A Benchmark // Proc of the IEEE Conference on Computer Vision and Pa-ttern Recognition. Washington, USA: IEEE, 2013: 2411-2418.
[37] MUELLER M, SMITH N, GHANEM B.A Benchmark and Simulator for UAV Tracking // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 445-461.
[38] BHAT G, DANELLJAN M, VAN GOOL L, et al. Learning Discriminative Model Prediction for Tracking // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2019: 6181-6190.
[39] WANG N, ZHOU W G, WANG J, et al. Transformer Meets Tra-cker: Exploiting Temporal Context for Robust Visual Tracking // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 1571-1580.
[40] SONG Z K, YU J Q, CHEN Y P, et al. Transformer Tracking with Cyclic Shifting Window Attention // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 8781-8790.
[41] ZHANG M H, ZHANG Q Y, SONG W, et al. PromptVT: Promp-ting for Efficient and Accurate Visual Tracking. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(8): 7373-7385.
[42] YAO J Z, WANG Z X, ZHANG J L, et al. Tracking in Tracking: An Efficient Method to Solve the Tracking Distortion. Engineering Applications of Artificial Intelligence, 2024, 135. DOI: 10.1016/j.engappai.2024.108698.
[43] WANG X M, NIE G H, MENG J X, et al. MIMTrack: In-Context Tracking via Masked Image Modeling. Proceedings of the AAAI Conference on Artificial Intelligence, 2025, 39(8): 7979-7987.
[44] WEI X, BAI Y F, ZHENG Y C, et al. Autoregressive Visual Tra-cking // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 9697-9706.