Fine-to-Coarse Grained Causality Co-Driven Approach for Explanatory Visual Question Answering
SHI Yecheng1,2, MIAO Jiali1,2, YU Kui1,2
1. School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601; 2. Key Laboratory of Knowledge Engineering with Big Data of Ministry of Education of China, Hefei University of Technology, Hefei 230009
Abstract:Explanatory visual question answering (EVQA) generates user-friendly multimodal explanations for the reasoning process while answering visual questions. Thereby, the credibility of model inference is enhanced. However, due to the lack of effective modeling of visual regions object relations, the explanations generated by existing explanatory visual question answering (EVQA) models suffer from the problem of inconsistency between visual regions and semantics. To address this issue, a fine-to-coarse grained causality co-driven (FCGC-CoD) approach for explanatory visual question answering is proposed. First, the causal relationships of visual regions features are modeled, and the influential and supportive objects are identified to enhance the multimodal representation capability of the vision-and-language pretrained model. Then, a joint variational causal inference network is designed to strengthen the coarse-grained reasoning process through fine-grained multimodal causal representations, and thus the generation of multimodal explanations and answers is achieved. Experimental results demonstrate that FCGC-CoD enhances the visual reasoning consistency of explanations while answering questions accurately.
[1] ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and Top-down Attention for Image Captioning and Visual Question Answering // Proc of the IEEE/CVF Conference on Computer Vision and Pa-ttern Recognition. Washington, USA: IEEE, 2018: 6077-6086. [2] 包希港,周春来,肖克晶,等.视觉问答研究综述.软件学报, 2021, 32(8): 2522-2544. (BAO X G, ZHOU C L, XIAO K J, et al. Survey on Visual Question Answering. Journal of Software, 2021, 32(8): 2522-2544.) [3] HUDSON D A, MANNING C D.GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 6693-6702. [4] TENEY D, LIU L Q, VAN DER HENGEL A. Graph-Structured Representations for Visual Question Answering // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 3233-3241. [5] 俞俊,汪亮,余宙.视觉问答技术研究.计算机研究与发展, 2018, 55(9): 1946-1958. (YU J, WANG L, YU Z.Research on Visual Question and Answer Technology. Journal of Computer Research and Development, 2018, 55(9): 1946-1958.) [6] HE K M, ZHANG X Y, REN S Q, et al. Deep Residual Learning for Image Recognition // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 770-778. [7] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. [8] PENNINGTON J, SOCHER R, MANNING C D.GloVe: Global Ve-ctors for Word Representation // Proc of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: ACL, 2014: 1532-1543. [9] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding // Proc of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies(Long and Short Papers). Stroudsburg, USA: ACL, 2019: 4171-4186. [10] LONG S Q, CAO F Q, HAN S C, et al. Vision-and-Language Pretrained Models: A Survey // Proc of the 31st International Joint Conference on Artificial Intelligence. San Francisco, USA: IJCAI,2022: 5530-5537. [11] LI L H, YATSKAR M, YIN D, et al. Visual BERT: A Simple and Performance Baseline for Visual and Language[C/OL].[2025-04-23]. https://arxiv.org/pdf/1908.03557. [12] TAN H, BANSAL M.LXMERT: Learning Cross-Modality Encoder Representations from Transformers // Proc of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Strouds-burg, USA: ACL, 2019: 5100-5111. [13] XIONG P X, YOU Q Z, YU P, et al. SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering[C/OL].[2025-04-23]. https://arxiv.org/pdf/2201.10654. [14] LI Q, TAO Q Y, JOTY S, et al. VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 552-567. [15] CHEN S, ZHAO Q.REX: Reasoning-Aware and Grounded Explanation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 15565-15574. [16] LI C L, XU H Y, TIAN J F, et al. mPLUG: Effective and Efficient Vision-Language Learning by Cross-Modal Skip-Connections // Proc of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: ACL, 2022: 7241-7259. [17] YAN M, XU H Y, LI C L, et al. Achieving Human Parity on Visual Question Answering. ACM Transactions on Information Systems, 2023, 41(3). DOI: 10.1145/357283. [18] YE Q H, XU H Y, YE J B, et al. mPLUG-OwI2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 13040-13051. [19] XUE D Z, QIAN S S, XU C S, et al. Variational Causal Inference Network for Explanatory Visual Question Answering // Proc of the IEEE/CVF International Conference on Computer Vision. Washing-ton, USA: IEEE, 2023: 2515-2525. [20] WANG T, ZHOU C, SUN Q R, et al. Causal Attention for Unbia-sed Visual Recognition // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 3071-3080. [21] YANG X, ZHANG H W, QI G J, et al. Causal Attention for Vision-Language Tasks // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 9842-9852. [22] LOUIZOS C, SHALIT U, MOOIJ J, et al. Causal Effect Inference with Deep Latent-Variable Models // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 6449-6459. [23] YANG M Y, LIU F R, CHEN Z T, et al. CausalVAE: Disentangled Representation Learning via Neural Structural Causal Models // Proc of the IEEE/CVF Conference on Computer Vision and Pa-ttern Recognition. Washington, USA: IEEE, 2021: 9588-9597. [24] XUE D Z, QIAN S S, XU C S.Few-Shot Multimodal Explanation for Visual Question Answering // Proc of the 32nd ACM International Conference on Multimedia. New York, USA: ACM, 2024: 1875-1884. [25] SPIRTES P, GLYMOUR C, SCHEINES R. Causation, Prediction, and Search. Cambridge, USA: MIT Press, 2001. [26] HECKMAN J J. Econometric Causality. International Statistical Review, 2008, 76(1): 1-27. [27] WANG Y X, CAO F Y, YU K, et al. Efficient Causal Structure Learning from Multiple Interventional Datasets with Unknown Targets. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(8): 8584-8593. [28] XUE D Z, QIAN S S, XU C S.Integrating Neural-Symbolic Reasoning with Variational Causal Inference Network for Explanatory Visual Question Answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(12): 7893-7908. [29] WANG T, HUANG J Q, ZHANG H W, et al. Visual Commonsense R-CNN // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 10757-10767. [30] EASTWOOD C, SINGH S, NICOLICIOIU A L, et al. Spuriosity Didn't Kill the Classifier: Using Invariant Predictions to Harness Spurious Features // Proc of the 37th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2023: 18291-18324. [31] VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 6000-6010. [32] YANG X T.Understanding the Variational Lower Bound[C/OL]. [2025-04-23].https://xyang35.github.io/2017/04/14/variational-lower-bound/. [33] PIERRE B, SADOWSKI P.The Dropout Learning Algorithm. Artificial Intelligence, 2014, 210: 78-122. [34] WU J L, MOONEY R J.Faithful Multimodal Explanation for Vi-sual Question Answering // Proc of the ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Stroudsburg, USA: ACL, 2019: 103-112.