粗细粒度因果关系协同驱动的可解释性视觉问答方法

doi:10.16451/j.cnki.issn1003-6059.202506005

摘要
图/表
参考文献
相关文章 (1)

全文: PDF (3371 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要可解释性视觉问答(Explanatory Visual Question Answering, EVQA)在回答视觉问题的同时为推理过程生成用户友好的多模态解释,从而提高模型推理的可信度.然而,由于缺乏对视觉区域对象关系的有效建模,现有EVQA生成的解释文本存在视觉区域与语义不一致的问题.为此,文中提出粗细粒度因果关系协同驱动的可解释性视觉问答方法(Fine-to-Coarse Grained Causality Co-Driven Approach for Explanatory Visual Question Answering, FCGC-CoD).首先,建模视觉区域特征的因果关系,识别其中的主体对象和支撑对象,增强视觉与语言预训练模型的多模态表征能力.然后,设计联合变分推理网络,通过细粒度的多模态因果表征增强模型粗粒度的宏观因果推理过程,实现多模态解释和答案的生成.实验表明,FCGC-CoD在准确回答问题的同时,可提升解释的视觉推理一致性.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	施业成
	缪佳李
	俞奎

关键词 ：可解释性视觉问答(EVQA), 因果推理, 粗细粒度因果关系, 多模态表征学习

Abstract：Explanatory visual question answering (EVQA) generates user-friendly multimodal explanations for the reasoning process while answering visual questions. Thereby, the credibility of model inference is enhanced. However, due to the lack of effective modeling of visual regions object relations, the explanations generated by existing explanatory visual question answering (EVQA) models suffer from the problem of inconsistency between visual regions and semantics. To address this issue, a fine-to-coarse grained causality co-driven (FCGC-CoD) approach for explanatory visual question answering is proposed. First, the causal relationships of visual regions features are modeled, and the influential and supportive objects are identified to enhance the multimodal representation capability of the vision-and-language pretrained model. Then, a joint variational causal inference network is designed to strengthen the coarse-grained reasoning process through fine-grained multimodal causal representations, and thus the generation of multimodal explanations and answers is achieved. Experimental results demonstrate that FCGC-CoD enhances the visual reasoning consistency of explanations while answering questions accurately.

Key words： Explanatory Visual Question Answering(EVQA) Causal Reasoning Fine-to-Coarse Grained Causality Multimodal Representation Learning

收稿日期: 2025-05-13

ZTFLH:

TP391

基金资助:新一代人工智能国家科技重大专项项目(No.2021ZD0111801)、国家自然科学基金项目(No.62376087)资助

通讯作者: 俞奎,博士,教授,主要研究方向为因果发现、机器学习.E-mail:yukui@hfut.edu.cn.

作者简介: 施业成,硕士研究生,主要研究方向为因果发现、自然语言处理、多模态推理.E-mail:shiyecheng@mail.hfut.edu.cn.
缪佳李,博士研究生,主要研究方向为因果发现、多标签学习、多模态表征学习.E-mail:miaojiali@mail.hfut.edu.cn.

引用本文:

施业成, 缪佳李, 俞奎. 粗细粒度因果关系协同驱动的可解释性视觉问答方法[J]. 模式识别与人工智能, 2025, 38(6): 552-564. SHI Yecheng, MIAO Jiali, YU Kui. Fine-to-Coarse Grained Causality Co-Driven Approach for Explanatory Visual Question Answering. Pattern Recognition and Artificial Intelligence, 2025, 38(6): 552-564.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.202506005 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2025/V38/I6/552

[1] ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and Top-down Attention for Image Captioning and Visual Question Answering // Proc of the IEEE/CVF Conference on Computer Vision and Pa-ttern Recognition. Washington, USA: IEEE, 2018: 6077-6086.
[2] 包希港,周春来,肖克晶,等.视觉问答研究综述.软件学报, 2021, 32(8): 2522-2544.
(BAO X G, ZHOU C L, XIAO K J, et al. Survey on Visual Question Answering. Journal of Software, 2021, 32(8): 2522-2544.)
[3] HUDSON D A, MANNING C D.GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 6693-6702.
[4] TENEY D, LIU L Q, VAN DER HENGEL A. Graph-Structured Representations for Visual Question Answering // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 3233-3241.
[5] 俞俊,汪亮,余宙.视觉问答技术研究.计算机研究与发展, 2018, 55(9): 1946-1958.
(YU J, WANG L, YU Z.Research on Visual Question and Answer Technology. Journal of Computer Research and Development, 2018, 55(9): 1946-1958.)
[6] HE K M, ZHANG X Y, REN S Q, et al. Deep Residual Learning for Image Recognition // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 770-778.
[7] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.
[8] PENNINGTON J, SOCHER R, MANNING C D.GloVe: Global Ve-ctors for Word Representation // Proc of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: ACL, 2014: 1532-1543.
[9] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding // Proc of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies(Long and Short Papers). Stroudsburg, USA: ACL, 2019: 4171-4186.
[10] LONG S Q, CAO F Q, HAN S C, et al. Vision-and-Language Pretrained Models: A Survey // Proc of the 31st International Joint Conference on Artificial Intelligence. San Francisco, USA: IJCAI,2022: 5530-5537.
[11] LI L H, YATSKAR M, YIN D, et al. Visual BERT: A Simple and Performance Baseline for Visual and Language[C/OL].[2025-04-23]. https://arxiv.org/pdf/1908.03557.
[12] TAN H, BANSAL M.LXMERT: Learning Cross-Modality Encoder Representations from Transformers // Proc of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Strouds-burg, USA: ACL, 2019: 5100-5111.
[13] XIONG P X, YOU Q Z, YU P, et al. SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering[C/OL].[2025-04-23]. https://arxiv.org/pdf/2201.10654.
[14] LI Q, TAO Q Y, JOTY S, et al. VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 552-567.
[15] CHEN S, ZHAO Q.REX: Reasoning-Aware and Grounded Explanation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 15565-15574.
[16] LI C L, XU H Y, TIAN J F, et al. mPLUG: Effective and Efficient Vision-Language Learning by Cross-Modal Skip-Connections // Proc of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: ACL, 2022: 7241-7259.
[17] YAN M, XU H Y, LI C L, et al. Achieving Human Parity on Visual Question Answering. ACM Transactions on Information Systems, 2023, 41(3). DOI: 10.1145/357283.
[18] YE Q H, XU H Y, YE J B, et al. mPLUG-OwI2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 13040-13051.
[19] XUE D Z, QIAN S S, XU C S, et al. Variational Causal Inference Network for Explanatory Visual Question Answering // Proc of the IEEE/CVF International Conference on Computer Vision. Washing-ton, USA: IEEE, 2023: 2515-2525.
[20] WANG T, ZHOU C, SUN Q R, et al. Causal Attention for Unbia-sed Visual Recognition // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 3071-3080.
[21] YANG X, ZHANG H W, QI G J, et al. Causal Attention for Vision-Language Tasks // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 9842-9852.
[22] LOUIZOS C, SHALIT U, MOOIJ J, et al. Causal Effect Inference with Deep Latent-Variable Models // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 6449-6459.
[23] YANG M Y, LIU F R, CHEN Z T, et al. CausalVAE: Disentangled Representation Learning via Neural Structural Causal Models // Proc of the IEEE/CVF Conference on Computer Vision and Pa-ttern Recognition. Washington, USA: IEEE, 2021: 9588-9597.
[24] XUE D Z, QIAN S S, XU C S.Few-Shot Multimodal Explanation for Visual Question Answering // Proc of the 32nd ACM International Conference on Multimedia. New York, USA: ACM, 2024: 1875-1884.
[25] SPIRTES P, GLYMOUR C, SCHEINES R. Causation, Prediction, and Search. Cambridge, USA: MIT Press, 2001.
[26] HECKMAN J J. Econometric Causality. International Statistical Review, 2008, 76(1): 1-27.
[27] WANG Y X, CAO F Y, YU K, et al. Efficient Causal Structure Learning from Multiple Interventional Datasets with Unknown Targets. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(8): 8584-8593.
[28] XUE D Z, QIAN S S, XU C S.Integrating Neural-Symbolic Reasoning with Variational Causal Inference Network for Explanatory Visual Question Answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(12): 7893-7908.
[29] WANG T, HUANG J Q, ZHANG H W, et al. Visual Commonsense R-CNN // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 10757-10767.
[30] EASTWOOD C, SINGH S, NICOLICIOIU A L, et al. Spuriosity Didn't Kill the Classifier: Using Invariant Predictions to Harness Spurious Features // Proc of the 37th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2023: 18291-18324.
[31] VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 6000-6010.
[32] YANG X T.Understanding the Variational Lower Bound[C/OL]. [2025-04-23].https://xyang35.github.io/2017/04/14/variational-lower-bound/.
[33] PIERRE B, SADOWSKI P.The Dropout Learning Algorithm. Artificial Intelligence, 2014, 210: 78-122.
[34] WU J L, MOONEY R J.Faithful Multimodal Explanation for Vi-sual Question Answering // Proc of the ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Stroudsburg, USA: ACL, 2019: 103-112.