Image-text multimodal sentiment analysis aims to predict sentimental polarity by integrating visual modalities and text modalities. The key to solving the multimodal sentiment analysis task is obtaining high-quality multimodal representations of both visual and textual modalities and achieving efficient fusion of these representations. Therefore, a cross-modal multi-level fusion sentiment analysis method based on visual language model(MFVL) is proposed. Firstly, based on the pre-trained visual language model, high-quality multimodal representations and modality bridge representations are generated by freezing the parameters and a low-rank adaptation method being adopted for fine-tuning the large language model. Secondly, a cross-modal multi-head co-attention fusion module is designed to perform interactive weighted fusion of the visual and textual modality representations respectively. Finally, a mixture of experts module is designed to deeply fuse the visual, textual and modality bridging representations to achieve multimodal sentiment analysis. Experimental results indicate that MFVL achieves state-of-the-art performance on the public evaluation datasets MVSA-Single and HFM.
[1] ZHU L N, ZHU Z C, ZHANG C W, et al. Multimodal Sentiment Analysis Based on Fusion Methods: A Survey. Information Fusion, 2023, 95: 306-325.
[2] XU N, MAO W J. MultiSentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis // Proc of the ACM Conference on Information and Knowledge Management. New York, USA: ACM, 2017: 2399-2402.
[3] TRUONG Q T, LAUW H W. VistaNet: Visual Aspect Attention Network for Multimodal Sentiment Analysis. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 305-312.
[4] XU N. Analyzing Multimodal Public Sentiment Based on Hierarchical Semantic Attentional Network // Proc of the IEEE International Conference on Intelligence and Security Informatics. Washington, USA: IEEE, 2017: 152-154.
[5] XU N, MAO W J, CHEN G D. Multi-interactive Memory Network for Aspect Based Multimodal Sentiment Analysis. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 371-378.
[6] YU J F, JIANG J. Adapting BERT for Target-Oriented Multimodal Sentiment Classification // Proc of the 28th International Joint Conference on Artificial Intelligence. San Francisco, USA: IJCAI, 5408-5414.
[7] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding // Proc of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies(Long and Short Papers). Stroudsburg, USA: ACL, 2019, I: 4171-4186.
[8] YANG X C, FENG S, ZHANG Y F, et al. Multimodal Sentiment Detection Based on Multi-channel Graph Neural Networks // Proc of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Long Papers). Stroudsburg, USA: ACL, 2021, I: 328-339.
[9] KHAN Z, FU Y. Exploiting BERT for Multimodal Target Sentiment Classification Through Input Space Translation // Proc of the 29th ACM International Conference on Multimedia. New York, USA: ACM, 2021: 3034-3042.
[10] VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 6000-6010.
[11] LI Z, XU B, ZHU C H, et al. CLMLF: A Contrastive Learning and Multi-layer Fusion Method for Multimodal Sentiment Detection // Proc of the Findings of the Association for Computational Linguistics. Stroudsburg, USA: ACL, 2022: 2282-2294.
[12] WEI Y W, YUAN S Z, YANG R S, ,et al. Tackling Modality He-terogeneity with Multi-view Calibration Network for Multimodal Sen-timent Detection // Proc of the 61st Annual Meeting of the Asso-ciation for Computational Linguistics(Long Papers). Stroudsburg, USA: ACL, 2023, I: 5240-5252.
[13] RADFORD A, KIM J W, HALLACY C, et al. Learning Transfe-rable Visual Models from Natural Language Supervision. Procee-dings of Machine Learning Research, 2021, 139: 8748-8763.
[14] GUTMANN M, HYVÄRINEN A. Noise-Contrastive Estimation: A New Estimation Principle for Unnormalized Statistical Models // Proc of the 13th International Conference on Artificial Intelligence and Statistics. San Diego, USA: JMLR, 2010: 297-304.
[15] LI J N, LI D X, XIONG C M, et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understan-ding and Generation. Proceedings of Machine Learning Research, 2022, 162: 12888-12900.
[16] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale[C/OL].[2024-01-09]. https://arxiv.org/pdf/2010.11929.
[17] LI J N, LI D X, SAVARESE S, et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models[C/OL].[2024-01-09]. https://arxiv.org/pdf/2301.12597.
[18] ZHANG S S, ROLLER S, GOYAL N, et al. OPT: Open Pre-Trained Transformer Language Models[C/OL].[2024-01-09]. https://arxiv.org/pdf/2205.01068.
[19] HU E, SHEN Y L, WALLIS P, et al. LoRA: Low-Rank Adaptation of Large Language Models[C/OL].[2024-01-09]. https://arxiv.org/pdf/2106.09685.
[20] SHAZEER N, MIRHOSEINI A, MAZIARZ K, et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer[C/OL].[2024-01-09]. https://arxiv.org/pdf/1701.06538.
[21] NIU T, ZHU S, PANG L, et al. Sentiment Analysis on Multi-view Social Data // Proc of the 22nd International Conference on MultiMedia Modeling. Berlin, Germany: Springer, 2016: 15-27.
[22] CAI Y T, CAI H Y, WAN X J. Multi-modal Sarcasm Detection in Twitter with Hierarchical Fusion Model // Proc of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: ACL, 2019: 2506-2515.
[23] KIM Y. Convolutional Neural Networks for Sentence Classification // Proc of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: ACL, 2014: 1746-1751.
[24] ZHOU P, SHI W, TIAN J, et al. Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification // Proc of the 54th Annual Meeting of the Association for Computa-tional Linguistics(Short Papers). Stroudsburg, USA: ACL, 2016, II: 207-212.
[25] HE K M, ZHANG X Y, REN S Q, et al. Deep Residual Learning for Image Recognition // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 770-778.
[26] XU N, MAO W J, CHEN G D. A Co-memory Network for Multimodal Sentiment Analysis // Proc of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM, 2018: 929-932.
[27] SCHIFANELLA R, DE JUAN P, TETREAULT J, et al. Detecting Sarcasm in Multimodal Social Platforms // Proc of the 24th ACM International Conference on Multimedia. New York, USA: ACM, 2016: 1136-1145.
[28] XU N, ZENG Z X, MAO W J. Reasoning with Multimodal Sarcastic Tweets via Modeling Cross-Modality Contrast and Semantic Association // Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: ACL, 2020: 3777-3786.