多模态注意力感知与相邻尺度建模的Transformer网络

doi:10.16451/j.cnki.issn1003-6059.202604002

Abstract
Figure/Table
References
Related Citation (15)

Download: PDF (2271 KB) HTML (1 KB)
Export: BibTeX | EndNote (RIS)

Abstract RGB-D salient object detection aims to identify the most visually attractive objects from paired color images and depth images, and the key challenge is the effective fusion of multimodal and multiscale features. The existing methods still need the improvement in modal complementary information representation, edge detail preservation and utilization of cross-scale association during the fusion of RGB features and depth features. Therefore, a Transformer network with multimodal attention perception and adjacent-scale modeling（MATNet） is proposed. Multilevel RGB features and depth features are extracted by dual-branch pyramid pooling Transformer encoders. A multimodal attention fusion module is introduced at each stage. The modal complementary information representation and semantic consistency in key regions are jointly enhanced by channel attention and spatial attention. Then, an adjacent-scale modeling module is constructed to aggregate adjacent-scale features progressively in a top-down manner. High-level semantic information and low-level edge texture information are fused effectively. The structural integrity and boundary representation capability of salient objects are improved. Finally, an end-to-end detection framework is constructed by combining multi-scale prediction and the supervision mechanism. Experiments on five public datasets demonstrate that MATNet is effective and stable in improving detection accuracy and edge preservation capability.

Key words： RGB-D Salient Object Detection Multimodal Attention Fusion Adjacent-Scale Modeling Transformer Multiscale Feature Fusion

Received: 19 January 2026

ZTFLH:

TP391

Fund:Key Program of Joint Funds of National Natural Science Foundation of China（No.U2568225）, National Natural Science Foundation of China（No.52372418,U2368203）, Innovation Capability Support Program of Shaanxi（No.2025RS-CXTD-006）

Corresponding Authors: HE Min, Ph.D., professor. His research interests include inte-lligent technologies in civil engineering.

About author:: About Author:SONG Xiaogang, Ph.D., professor. His research interests include computer vision and autonomous unmanned navigation systems.ZHANG Haoze, Master student. His research interests include salient object detection and artificial intelligence security.ZHANG Xiaolong, Master student. His research interests include salient object detection.ZHAO Qin, Ph.D., professor. Her research interests include artificial intelligence and big data.HEI Xinhong, Ph.D., professor. His research interests include computer vision and artificial intelligence.

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	SONG Xiaogang
	ZHANG Haoze
	ZHANG Xiaolong
	ZHAO Qin
	HEI Xinhong
	HE Min

Cite this article:

SONG Xiaogang,ZHANG Haoze,ZHANG Xiaolong等. Transformer Network with Multimodal Attention Perception and Adjacent-Scale Modeling[J]. Pattern Recognition and Artificial Intelligence, 2026, 39(4): 311-329.

URL:

http://manu46.magtech.com.cn/Jweb_prai/EN/10.16451/j.cnki.issn1003-6059.202604002 OR http://manu46.magtech.com.cn/Jweb_prai/EN/Y2026/V39/I4/311

[1] ITTI L, KOCH C, NIEBUR E.A Model of Saliency-Based Visual Attention for Rapid Scene Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998, 20(11): 1254-1259.
[2] LI G Y, LIU Z, CHEN M Y, et al. Hierarchical Alternate Interaction Network for RGB-D Salient Object Detection. IEEE Transactions on Image Processing, 2021, 30: 3528-3542.
[3] ZHANG M, YAO S Y, HU B Q, et al. C²DFNet: Criss-Cross Dynamic Filter Network for RGB-D Salient Object Detection. IEEE Transactions on Multimedia, 2023, 25: 5142-5154.
[4] WU Y H, LIU Y, ZHAN X, et al. P2T: Pyramid Pooling Transformer for Scene Understanding. IEEE Transactions on Pattern Analy-sis and Machine Intelligence, 2023, 45(11): 12760-12771.
[5] LANG C Y, NGUYEN T V, KATTI H, et al. Depth Matters: Influence of Depth Cues on Visual Saliency // Proc of the 12th European Conference on Computer Vision. Berlin, Germany: Springer, 2012: 101-115.
[6] CIPTADI A, HERMANS T, REHG J M.An in Depth View of Saliency[C/OL]. [2025-12-17].https://bmva-archive.org.uk/bmvc/2013/Papers/paper0112/paper0112.pdf.
[7] REN J Q, GONG X J, YU L, et al. Exploiting Global Priors for RGB-D Saliency Detection // Proc of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. Washington, USA: IEEE, 2015: 25-32.
[8] FENG D, BARNES N, YOU S D, et al. Local Background Enclosure for RGB-D Salient Object Detection // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 2343-2350.
[9] ZHOU W J, ZHU Y, LEI J S, et al. CCAFNet: Crossflow and Cross-Scale Adaptive Fusion Network for Detecting Salient Objects in RGB-D Images. IEEE Transactions on Multimedia, 2022, 24: 2192-2204.
[10] BI H B, WU R W, LIU Z Q, et al. Cross-Modal Hierarchical Inter-action Network for RGB-D Salient Object Detection. Pattern Recognition, 2023, 136. DOI: 10.1016/j.patcog.2022.109194.
[11] PANG Y W, ZHAO X Q, ZHANG L H, et al. CAVER: Cross-Modal View-Mixed Transformer for Bi-modal Salient Object Detection. IEEE Transactions on Image Processing, 2023, 32: 892-904.
[12] CHEN T Y, XIAO J, HU X G, et al. Adaptive Fusion Network for RGB-D Salient Object Detection. Neurocomputing, 2023, 522: 152-164.
[13] CHEN Q, ZHANG Z X, LU Y Y, et al. 3-D Convolutional Neural Networks for RGB-D Salient Object Detection and Beyond. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(3): 4309-4323.
[14] ZHANG Q, QIN Q, YANG Y, et al. Feature Calibrating and Fusing Network for RGB-D Salient Object Detection. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34(3): 1493-1507.
[15] FANG X, JIANG M F, ZHU J C, et al. GroupTransNet: Group Transformer Network for RGB-D Salient Object Detection. Neurocomputing, 2024, 594. DOI: 10.1016/j.neucom.2024.127865.
[16] GAO L N, LIU B, FU P, et al. Self-Supervised Pretraining with Multimodality Representation Enhancement for Salient Object Detection in RGB-D Images. IEEE Transactions on Instrumentation and Measurement, 2025, 74. DOI: 10.1109/TIM.2025.3547529.
[17] JIANG M F, MA J H, CHEN J T, et al. PATNet: Patch-to-Pixel Attention-Aware Transformer Network for RGB-D and RGB-T Salient Object Detection. Knowledge-Based Systems, 2024, 291. DOI:10.1016/j.knosys.2024.111597.
[18] WANG Q L, WU B G, ZHU P F, et al. ECA-Net: Efficient Cha-nnel Attention for Deep Convolutional Neural Networks // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 11531-11539.
[19] FAN D P, CHENG M M, LIU Y, et al. Structure-Measure: A New Way to Evaluate Foreground Maps // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2017: 4558-4567.
[20] ACHANTA R, HEMAMI S, ESTRADA F, et al. Frequency-Tuned Salient Region Detection // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2009: 1597-1604.
[21] FAN D P, GONG C, CAO Y, et al. Enhanced-Alignment Measure for Binary Foreground Map Evaluation // Proc of the 27th International Joint Conference on Artificial Intelligence. San Francisco, USA: IJCAI, 2018: 698-704.
[22] PERAZZI F, KRÄHENBÜHL P, PRITCH Y, et al. Saliency Filters: Contrast Based Filtering for Salient Region Detection // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2012: 733-740.
[23] JI W, LI J J, YU S, et al. Calibrated RGB-D Salient Object Detection // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 9466-9476.
[24] LEE M, PARK C, CHO S, et al. SPSN: Superpixel Prototype Sampling Network for RGB-D Salient Object Detection // Proc of the 17th European Conference on Computer Vision. Berlin, Germany: Springer, 2022: 630-647.
[25] CONG R M, LIN Q W, ZHANG C, et al. CIR-Net: Cross-Modality Interaction and Refinement for RGB-D Salient Object Detection. IEEE Transactions on Image Processing, 2022, 31: 6800-6815.
[26] JI W, YAN G, LI J J, et al. DMRA: Depth-Induced Multi-scale Recurrent Attention Network for RGB-D Saliency Detection. IEEE Transactions on Image Processing, 2022, 31: 2321-2336.
[27] FU K R, FAN D P, JI G P, et al. Siamese Network for RGB-D Salient Object Detection and Beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(9): 5541-5559.
[28] WU J Y, SUN F M, XU R, et al. Aggregate Interactive Learning for RGB-D Salient Object Detection. Expert Systems with Applications, 2022, 195. DOI: 10.1016/j.eswa.2022.116614.
[29] GAO L N, LIU B, FU P, et al. Depth-Aware Inverted Refinement Network for RGB-D Salient Object Detection. Neurocomputing, 2023, 518: 507-522.
[30] WU Z W, ALLIBERT G, MERIAUDEAU F, et al. HiDAnet: RGB-D Salient Object Detection via Hierarchical Depth Awareness. IEEE Transactions on Image Processing, 2023, 32: 2160-2173.
[31] GAO L N, LIU B, FU P, et al. TSVT: Token Sparsification Vision Transformer for Robust RGB-D Salient Object Detection. Pa-ttern Recognition, 2024, 148. DOI: 10.1016/j.patcog.2023.110190.
[32] ZONG G Y, LI X, XU Q M.Scenario Potentiality-Constrain Network for RGB-D Salient Object Detection. Knowledge-Based Systems, 2025, 310. DOI: 10.1016/j.knosys.2024.112910.
[33] ZHOU Q W, WANG J T, LI J Q, et al. RMFDNet: Redundant and Missing Feature Decoupling Network for Salient Object Detection. Engineering Applications of Artificial Intelligence, 2025, 139(A). DOI: 10.1016/j.engappai.2024.109459.
[34] HAN J Y, WANG M Y, WU W Y, et al. Perceptual Localization and Focus Refinement Network for RGB-D Salient Object Detection. Expert Systems with Applications, 2025, 259. DOI: 10.1016/j.eswa.2024.125278.
[35] BERNAL J, SÁNCHEZ F J, FERNÁNDEZ-ESPARRACH G, et al. WM-DOVA Maps for Accurate Polyp Highlighting in Colonoscopy: Validation vs. Saliency Maps from Physicians. Computerized Medical Imaging and Graphics, 2015, 43: 99-111.
[36] JHA D, SMEDSRUD P H, RIEGLER M A, et al. Kvasir-SEG: A Segmented Polyp Dataset // Proc of the 26th International Confe-rence on Multimedia Modeling. Berlin. Kvasir-SEG: A Segmented Polyp Dataset // Proc of the 26th International Confe-rence on Multimedia Modeling. Berlin, Germany：Springer, 2020, II: 451-462.
[37] TAJBAKHSH N, GURUDU S R, LIANG J M.Automated Polyp Detection in Colonoscopy Videos Using Shape and Context Information. IEEE Transactions on Medical Imaging, 2016, 35(2): 630-644.
[38] SILVA J, HISTACE A, ROMAIN O, et al. Toward Embedded Detection of Polyps in WCE Images for Early Diagnosis of Colorectal Cancer. International Journal of Computer Assisted Radiology and Surgery, 2014, 9: 283-293.
[39] RONNEBERGER O, FISCHER P, BROX T. U-Net: Convolutio-nal Networks for Biomedical Image Segmentation // Proc of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention. Berlin, Germany: Springer, 2015: 234-241.
[40] ZHOU Z W, SIDDIQUEE M M R, TAJBAKHSH N, et al. UNet++: A Nested U-Net Architecture for Medical Image Segmentation // Proc of the 4th International Workshop on Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Berlin, Germany: Springer, 2018: 3-11.
[41] FAN D P, JI G P, ZHOU T, et al. PraNet: Parallel Reverse Atten-tion Network for Polyp Segmentation // Proc of the 23rd International Conference on Medical Image Computing and Computer-Assisted Intervention. Berlin, Germany: Springer, 2020: 263-273.
[42] FAN D P, JI G P, CHENG M M, et al. Concealed Object Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(10): 6024-6042.