Pattern Recognition and Artificial Intelligence

模式识别与人工智能

Home About Journal Editorial Board Instructions Ethics Statement Contact Us 中文

	Judgement and Disposal of Academic Misconduct Article

	Copyright Transfer Agreement

	Proof of Confidentiality

	Requirements for Electronic Version

More....

	Chinese Association of Automation

	National ResearchCenter for Intelligent Computing System

	Institute of Intelligent Machines,Chinese Academy of Sciences

More....

	2025 Vol.38 Issue.4, Published 2025-04-25

	Researches and Applications Special Topics of Academic Papers at the 27th Annual Meeting of the China Association for Science and Technology

Special Topics of Academic Papers at the 27th Annual Meeting of the China Association for Science and Technology

	293	Dynamic Semantic Clustering Relation Modeling Method for Object Tracking
		NIE Guohao, WANG Xingmei, XU Yuezhu, YANG Wentao
		When the Transformer based object tracking method employs the global attention mechanism to model the spatial relations between the search area and the template, the target deformation can lead to the degradation of feature discriminability, causing confusion between the target and the background. To solve this problem, a dynamic semantic clustering relation modeling method for object tracking is proposed. First, a semantic relation modeling module is constructed. Local attention mechanisms in the feature space are employed to concentrate on semantically similar feature vectors, thereby effectively suppressing erroneous interactions between the target and the distracting background. Second, graph neural networks are utilized to capture local correlations and design a dynamic semantic clustering module. The module adaptively generates semantic category partitions, enabling dynamic attention mechanisms to enhance the discriminative information between the target and the background. Finally, a semantic background elimination strategy is designed to effectively suppress the interference from background features during relationship modeling, thereby improving tracking efficiency. Experimental results on six benchmark datasets demonstrate the superiority of the proposed method.
		2025 Vol. 38 (4): 293-309 [Abstract] ( 374 ) [HTML 1KB] [ PDF 4296KB] ( 351 )

	310	Three-Dimensional Rotation Equivariant Self-Supervised Learning Vector Network Combined with Diffusion Model
		SHEN Kedi, ZHAO Jieyu, XIE Min
		Some networks for processing 3D data lack rotation equivariance and have difficulty in processing 3D objects after unknown rotation and estimating their pose changes. To solve this problem, a three-dimensional rotation equivariant self-supervised learning vector network combined with diffusion model is proposed in this paper. The network is designed to learn the rotation information of 3D objects, handle the pose change estimation task, and optimize the overall pose information using the local pose information denoised by the diffusion model. For the equivariant vector network, the scalar data are promoted to vector representations using vector neurons. Self-supervised learning is implemented without labeled data to enable the network to learn the vector information of 3D targets and achieve rotation reconstruction and pose change estimation of 3D data. Meanwhile, to solve the problem of local deviation in the pose estimation results, a diffusion model is constructed to optimize the overall pose change estimation results. The model learns local pose information in the process of noising and denoising, and can effectively remove the noise in the local pose. Experiments demonstrate that the designed network can estimate pose changes of the data in 3D space when the test data are randomly rotated, and it outperforms other networks. Moreover, the proposed model achieves superior performance in the reassembly task compared with current state-of-the-art methods, and optimizes the overall pose information through local pose information.
		2025 Vol. 38 (4): 310-324 [Abstract] ( 236 ) [HTML 1KB] [ PDF 6326KB] ( 269 )

	325	Two-Domain Feature Association Networks for Image Classification
		YUAN Heng, YU Dongqi, GAO Yuan
		The performance improvement of image classification network is constrained due to the reliance on spatial domain features and the neglect of the role of frequency domain features. To address these issues, two-domain feature association networks for image classification(TANet) are proposed. First, a frequency domain feature extraction(FDFE) module is designed. The Fast Fourier Transform is employed to effectively capture the frequency domain detail information and global features in the image, reduce key feature loss, enhance the representation ability of image detail information, and improve the image features extraction ability of the network. Then, the frequency domain attention mechanism(FDAM) is proposed. The multi-scale spatial domain features are taken into account and combined with the Fast Fourier Transform to extract the frequency domain information. Through FDAM, the sensitivity to image details is enhanced, and the contribution of key regions is improved. Subsequently, a two-domain feature association mechanism(TFAM) is designed to fuse the frequency domain features with the spatial domain features. On the basis of retaining spatial domain features, the frequency domain features are utilized to supplement the image detail information as well as the global features and thereby enhance the expression ability of the features. Finally, FDAM is embedded into the residual branch to learn the two-domain features of the input data more effectively. Thus, the attention between local and global information is balanced, the availability of key features is enhanced, and the capability of the network in image classification is improved. Experiments on five public datasets show that TANet enhances the image classification performance of the network by incorporating frequency domain features, extracting image detail information and global features, reducing key feature loss, enhancing the perception of important regions, and improving the expression of features.
		2025 Vol. 38 (4): 325-340 [Abstract] ( 262 ) [HTML 1KB] [ PDF 2456KB] ( 265 )

	341	Active Clustering with Tailored Nearest Neighbor Graph
		XIE Wenbo, DENG Tao, FU Xun, CHEN Bin, ZOU Tian, WANG Xin
		In modern data analysis and machine learning applications, how to extract critical information from newly acquired data for efficient grouping (clustering) and annotation remains a central challenge for clustering algorithms. Traditional unsupervised clustering algorithms struggle to meet the high-quality data requirements of complex tasks, such as pre-training large models due to the lack of prior information guidance. Active learning methods can effectively improve clustering accuracy, but their practical application is constrained by high human interaction costs and computational overhead. To address these issues, an algorithm of active clustering with tailored nearest neighbor graph(ACNNG) is proposed. A sparse nearest neighbor graph is constructed to model relationships between data points. Based on this graph, the topological centrality and uncertainty of data points are comprehensively computed to effectively identify key data points. A small number of pairwise constraints are collected from users to significantly enhance clustering accuracy. Furthermore, an efficient label propagation mechanism collaborating with the nearest neighbor graph structure is employed. By leveraging the sparse graph structure for low-cost label propagation, ACNNG substantially reduces the spatiotemporal complexity and improves scalability for large-scale data processing. Experiments on real-world and synthetic datasets demonstrate that ACNNG achieves higher clustering accuracy with fewer pairwise constraints, less runtime, and less memory consumption, showcasing its potential for practical applications.
		2025 Vol. 38 (4): 341-358 [Abstract] ( 265 ) [HTML 1KB] [ PDF 4003KB] ( 257 )

	359	Text-to-Image Generation via Dual Optimization Stable Diffusion Model
		HUANG Jinjie, LIU Bin
		The stable diffusion(SD) model is unable to ensure full alignment between the generated images and the input textual prompts, while handling text prompts containing multiple objects. Moreover, the complete retraining of the SD model requires enormous computational resources. To solve this problem, a training-free method, text-to-image generation via dual optimization stable diffusion model(DualOpt-SD) is proposed. First, a layout-to-image generation(L2I) model is integrated into a text-to-image generation(T2I) model through a generation framework based on a pre-trained SD model. Next, the dual optimization(DualOpt) strategy is designed to optimize the output noise by the model during the inference process. DualOpt consists of two parts: one part adjusts the prior knowledge learned by L2I and T2I dynamically based on attention scores, and the other part focuses on the requirements of different denoising stages and applies varying attention to L2I and T2I. Experiments demonstrate that when the text prompt contains multiple objects, DualOpt-SD improves compositional accuracy while preserving strong interpretative capabilities of SD model. Furthermore, DualOpt-SD achieves higher overall image generation performance and produces images with high realism and reasonable object placement.
		2025 Vol. 38 (4): 359-373 [Abstract] ( 240 ) [HTML 1KB] [ PDF 3756KB] ( 238 )

Researches and Applications

	374	Image Captioning Based on Cross-Modal Prior Injection
		JIANG Zetao, ZHANG Luhao, PAN Yiwei, LI Mengtong, YANG Jianchen
		Combining semantic information from text and image modalities is one of the key points of image captioning. However, existing image captioning methods often treat text information merely as the constraints in the decoding stage or simply concatenate and fuse text features with image features. As a result, insufficient cross-modal interaction between text and image is caused and a modality gap is created. Consequently, the semantic information contained in the text cannot be fully utilized in the encoding stage. To address this issue, a method for image captioning based on cross-modal prior injection(CMPI) is proposed. First, the textual prior knowledge is extracted through contrastive language-image pre-training(CLIP). Then, the textual prior knowledge is interacted with the modal medium for the first time, and the cross-modal features containing both textual and image semantic information are obtained. Finally, the second modal interaction is performed between the cross-modal features and the grid features of the image. With cross-modal features as a medium, the prior knowledge of the text is injected into the image features. In this way, the semantic information of the text is incorporated without damaging the structure of the image features ,and the modality gap is alleviated. Experimental results on Karpathy splits of MSCOCO dataset show that CMPI achieves a CIDEr score of 128.0 in the first training stage and 140.5 in the second training stage , demonstrating a clear advantage.
		2025 Vol. 38 (4): 374-384 [Abstract] ( 248 ) [HTML 1KB] [ PDF 1506KB] ( 346 )

模式识别与人工智能

Supervised by
China Association for Science and Technology
Sponsored by
Chinese Association of Automation
NationalResearchCenter for Intelligent Computing System
Institute of Intelligent Machines, Chinese Academy of Sciences
Published by
Science Press

Copyright © 2010 Editorial Office of Pattern Recognition and Artificial Intelligence
Address: No.350 Shushanhu Road, Hefei, Anhui Province, P.R. China Tel: 0551-65591176 Fax:0551-65591176 Email: bjb@iim.ac.cn
Supported by Beijing Magtech 　Email:support@magtech.com.cn