|
|
A Survey on Multimodal Sentiment Analysis |
ZHANG Yazhou1, RONG Lu2, SONG Dawei3, ZHANG Peng4 |
1. College of Software Engineering, Zhengzhou University of Light Industry, Zhengzhou 450002; 2. Personnel Department, Zhengzhou University of Light Industry, Zhengzhou 450002; 3. School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081; 4. College of Intelligence and Computing, Tianjin University, Tianjin 300350 |
|
|
Abstract Multimodal sentiment analysis is one of the core research topics in the field of natural language processing. Firstly, the research background of multimodal sentiment analysis is introduced. Two sub-topics of multimodal sentiment analysis, narrative multimodal sentiment analysis and interactive multimodal sentiment analysis, are proposed. Then, the development and the research progress at home and abroad are summarized based on the mentioned two sub-topics. Finally, the existing scientific problems of interactive modeling in this field are summarized, and the future development trend is discussed.
|
Received: 31 December 2019
|
|
Fund:Supported by National Key Research and Development Program of China(No.2018YFC0831704), National Natural Science Foundation of China(No.U1636203,61772363,U1736103), Major Project of Zhejiang Laboratory(No.2019DH0ZX01), European Union's Horizon 2020 Research and Innovation Program under the Marie Skodowska-Curie Grant Agreement(No.721321) |
About author:: (ZHANG Yazhou, Ph.D., lecturer. His research interests include multimodal sentiment analysis, natural language processing and quantum cognition.);(RONG Lu, master, research assistant. Her research interests include natural language understanding, machine translation and computational language.);(SONG Dawei(Corresponding author), Ph.D., professor. His research interests include information retrieval, natural language processing and quantum cognition.);(ZHANG Peng, Ph.D., associate profe-ssor. His research interests include information retrieval, natural language processing and quantum cognition.) |
|
|
|
[1] PORIA S, CAMBRIA E, BAJPAI R, et al. A Review of Affective Computing: From Unimodal Analysis to Multimodal Fusion. Information Fusion, 2017, 37: 98-125. [2] 张新生,高 腾.多头注意力记忆网络的对象级情感分类.模式识别与人工智能, 2019, 32(11): 997-1005. (ZHANG X S, GAO T. Aspect Level Sentiment Classification with Multiple-Head Attention Memory Network. Pattern Recognition and Artificial Intelligence, 2019, 32(11): 997-1005.) [3] 谢丽星,周 明,孙茂松.基于层次结构的多策略中文微博情感分析和特征抽取.中文信息学报, 2012, 26(1): 73-83. (XIE L X, ZHOU M, SUN M S. Hierarchical Structure Based Hybrid Approach to Sentiment Analysis of Chinese Micro Blog and Its Feature Extraction. Journal of Chinese Information Processing, 2012, 26(1): 73-83.) [4] CAMBRIA E. Affective Computing and Sentiment Analysis. IEEE Intelligent Systems, 2016, 31(2): 102-107. [5] ZHANG Y Z, SONG D W, ZHANG P, et al. A Quantum-Inspired Multimodal Sentiment Analysis Framework. Theoretical Computer Science, 2018, 752: 21-40. [6] ZHANG Y Z, SONG D W, LI X, et al. A Quantum-Like Multimodal Network Framework for Modeling Interaction Dynamics in Multiparty Conversational Sentiment Analysis. Information Fusion, 2020, 62: 14-31. [7] YOU Q Z, LUO J B, JIN H L, et al. Cross-Modality Consistent Regression for Joint Visual-Textual Sentiment Analysis of Social Multimedia // Proc of the 9th ACM International Conference on Web Search and Data Mining. New York, USA: ACM, 2016: 13-22. [8] QIAN Y F, ZHANG Y, MA X, et al. EARS: Emotion-Aware Re-commender System Based on Hybrid Information Fusion. Information Fusion, 2019, 46: 141-146. [9] VERMA S, WANG C, ZHU L M, et al. DeepCU: Integrating Both Common and Unique Latent Information for Multimodal Sentiment Analysis // Proc of the 28th International Joint Conference on Artificial Intelligence. New York, USA: ACM, 2019: 3627-3634. [10] ATREY P K, HOSSAIN M A, EL SADDIK A, et al. Multimodal Fusion for Multimedia Analysis: A Survey. Multimedia Systems, 2010, 16(6): 345-379. [11] PANG B, LEE L, VAITHYANATHAN S. Thumbs up?: Sentiment Classification Using Machine Learning Techniques // Proc of the ACL Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: ACL, 2002: 79-86. [12] SCHMIDT S, STOCK W G. Collective Indexing of Emotions in Ima-ges: A Study in Emotional Information Retrieval. Journal of the American Society for Information Science and Technology, 2009, 60(5): 863-876. [13] YUAN J B, MCDONOUGH S, YOU Q Z, et al. Sentribute: Image Sentiment Analysis from a Mid-level Perspective // Proc of the 2nd International Workshop on Issues of Sentiment Discovery and Opi-nion Mining. New York, USA: SIGKDD, 2013: 10-12. [14] BORTH D, JI R R, CHEN T, et al. Large-Scale Visual Sentiment Ontology and Detectors Using Adjective Noun Pairs // Proc of the 21st ACM International Conference on Multimedia. New York, USA: ACM, 2013: 223-232. [15] CAO D L, JI R R, LI D Z, et al. Visual Sentiment Topic Model Based Microblog Image Sentiment Analysis. Multimedia Tools and Application, 2016, 75(15): 8955-8968. [16] CAO D L, JI R R, LIN D Z, et al. A Cross-Media Public Sentiment Analysis System for Microblog. Multimedia System, 2016, 22(4): 479-486. [17] LI Z H, FAN Y Y, LIU W H, et al. Image Sentiment Prediction Based on Textual Descriptions with Adjective Noun Pairs. Multimedia Tools and Application, 2018, 77(1): 1115-1132. [18] WANG M, CAO D L, LI X L, et al. Microblog Sentiment Analysis Based on Cross-Media Bag-of-Words Model // Proc of the International Conference on Internet Multimedia Computing and Service. New York, USA: ACM, 2014: 76-80. [19] WAGNER J, ANDRE E, LINGENFELSER F, et al. Exploring Fusion Methods for Multimodal Emotion Recognition with Missing Data. IEEE Transactions on Affective Computing, 2011, 2(4): 206-218. [20] GLODEK M, PEUTER S, SCHELS M, et al. Kalman Filter Based Classifier Fusion for Affective State Recognition // Proc of the International Workshop on Multiple Classifier Systems. Berlin, Germany: Springer, 2013: 85-94. [21] LI Z H, FAN Y Y, JIANG B, et al. A Survey on Sentiment Analysis and Opinion Mining for Social Multimedia. Multimedia Tools and Applications, 2019, 78(6): 6939-6967. [22] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding // Proc of the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg, USA: ACL, 2019: 4171-4186. [23] YOU Q Z, CAO L L, JIN H L, et al. Robust Visual-Textual Sentiment Analysis: When Attention Meets Tree-Structured Recursive Neural Networks // Proc of the 24th ACM International Conference on Multimedia. New York, USA: ACM, 2016: 1008-1017. [24] PORIA S, CAMBRIA E, HAZARIKA D, et al. Multi-level Multiple Attentions for Contextual Multimodal Sentiment Analysis // Proc of the IEEE International Conference on Data Mining. Wa-shington, USA: IEEE, 2017: 1033-1038. [25] ZADEH A, CHEN M H, PORIA S, et al. Tensor Fusion Network for Multimodal Sentiment Analysis // Proc of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: ACL, 2017: 1103-1114. [26] CAI G Y, XIA B B. Convolutional Neural Networks for Multimedia Sentiment Analysis // Proc of the CCF International Conference on Natural Language Processing and Chinese Computing. Berlin, Germany: Springer, 2015: 159-167. [27] YU Y H, LIN H F, MENG J N, et al. Visual and Textual Sentiment Analysis of a Microblog Using Deep Convolutional Neural Networks. Algorithms, 2016, 9(2): 41-46. [28] CHEN M H, WANG S, LIANG P P, et al. Multimodal Sentiment Analysis with Word-Level Fusion and Reinforcement Learning // Proc of the 19th ACM International Conference on Multimodal Interaction. New York, USA: ACM, 2017: 163-171. [29] HUANG F R, ZHANG X M, ZHAO Z H, et al. Image-Text Sentiment Analysis via Deep Multimodal Attentive Fusion. Knowledge-Based Systems, 2019, 167: 26-37. [30] PORIA S, MAJUMDER N, HAZARIKA D, et al. Multimodal Sen-timent Analysis: Addressing Key Issues and Setting up the Baselines. IEEE Intelligent Systems, 2018, 33(6): 17-25. [31] TAO J H. A Novel Prosody Adaptation Method for Mandarin Concatenation-Based Text-to-Speed System. Acoustical Science and Technology, 2009, 30(1): 33-41. [32] SEBE N, COHEN L, GEVERS T, et al. Emotion Recognition Based on Joint Visual and Audio Cues // Proc of the 18th International Conference on Pattern Recognition. Washington, USA: IEEE, 2006: 1136-1139. [33] MORENCY L P, MIHALCEA R, DOSHI P. Towards Multimodal Sentiment Analysis: Harvesting Opinions from the Web // Proc of the 13th International Conference on Multimodal Interfaces. New York, USA: ACM, 2011: 169-176. [34] PÉREZ-ROSAS V, MIHALCEA R, MORENCY L P. Utterance-Level Multimodal Sentiment Analysis // Proc of the 51st Annual Meeting of the Association for Computational Linguistics. New York, USA: ACM, 2013: 973-982. [35] WU C H, LIANG W B. Emotion Recognition of Affective Speech Based on Multiple Classifiers Using Acoustic-Prosodic Information and Semantic Labels. IEEE Transactions on Affective Computing, 2011, 2(1): 10-21. [36] ELLIS J G, JOU B, CHANG S F. Why We Watch the News: A Dataset for Exploring Sentiment in Broadcast Video News // Proc of the 16th International Conference on Multimodal Interaction. New York, USA: ACM, 2014: 104-111. [37] SARKAR C, BHATIA S, AGARWAL A, et al. Feature Analysis for Computational Personality Recognition Using YouTube Persona-lity Data Set // Proc of the ACM Multi-media Workshop on Computational Personality Recognition. New York, USA: ACM, 2014: 11-14. [38] ALAM F, RICCARDI G. Predicting Personality Traits Using Multimodal Information // Proc of the ACM Multi-media Workshop on Computational Personality Recognition. New York, USA: ACM, 2014: 15-18. [39] MONKARESI H, HUSSAIN M S, CALVO R A. Classification of Affects Using Head Movement, Skin Color Features and Physiological Signals // Proc of the IEEE International Conference on Systems, Man, and Cybernetic. Washington, USA: IEEE, 2012: 2664-2669. [40] WANG S F, ZHU Y C, WU G B, et al. Hybrid Video Emotional Tagging Using Users′ EEG and Video Content. Multimedia Tools and Applications, 2014, 72(2): 1257-1283. [41] PORIA S, CAMBRIA E, HUSSAIN A, et al. Towards an Intelligent Framework for Multimodal Affective Data Analysis. Neural Networks, 2015, 63: 104-116. [42] ZHANG X Y, XU C S, CHENG J, et al. Effective Annotation and Search for Video Blogs with Integration of Context and Content Analysis. IEEE Transactions on Multimedia, 2009, 11(2): 272-285. [43] LI H R, XU H. Video-Based Sentiment Analysis with HVNLBP-TOP Feature and bi-LSTM // Proc of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI, 2019: 9963-9964. [44] WOLLMER M, WENINGER F, KNAUP T, et al. YouTube Movie Reviews: Sentiment Analysis in an Audio-Visual Context. IEEE Intelligent Systems, 2013, 28(3): 46-53. [45] PORIA S, CAMBRIA E, GELBUKH A. Deep Convolutional Neural Network Textual Features and Multiple Kernel Learning for Utterance-Level Multimodal Sentiment Analysis // Proc of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: ACL, 2015: 2539-2544. [46] MANSOORIZADEH M, CHARKARI N M. Multimodal Information Fusion Application to Human Emotion Recognition from Face and Speech. Multimedia Tools and Applications, 2010, 49(2): 277-297. [47] HUSSAIN M S, CALVO R A, POUR P A. Hybrid Fusion Approach for Detecting Affects from Multichannel Physiology // Proc of the International Conference on Affective Computing and Intelligent Interaction. Washington, USA: IEEE, 2011: 568-577. [48] PORIA S, HUSSAIN A, CAMBRIA E. Beyond Text Based Sentiment Analysis: Towards Multi-modal Systems. Cognitive Computation, 2013, 3(1): 11-14. [49] PORIA S, CAMBRIA E, HOWARD N, et al. Fusing Audio, Visual and Textual Clues for Sentiment Analysis from Multimodal Content. Neurocomputing, 2016, 174(A): 50-59. [50] PORIA S, MAJUMDER N, MIHALCEA R, et al. Emotion Recognition in Conversation: Research Challenges, Datasets, and Recent Advances. IEEE Access, 2019, 7: 100943-100953. [51] ZHENG W L, LU B L. Investigating Critical Frequency Bands and Channels for EEG-Based Emotion Recognition with Deep Neural Networks. IEEE Transactions on Autonomous Mental Development, 2015, 7(3): 162-175. [52] LI Y, TAO J H, CHAO L L, et al. CHEAVD: A Chinese Natural Emotional Audio-Visual Database. Journal of Ambient Intelligence and Humanized Computing, 2016, 8(6): 913-924. [53] XU N, MAO W J, CHEN G D. Multi-interactive Memory Network for Aspect Based Multimodal Sentiment Analysis // Proc of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI, 2019: 371-378. [54] ZHANG Y F, LAI G K, ZHANG M, et al. Explicit Factor Models for Explainable Recommendation Based on Phrase-Level Sentiment Analysis // Proc of the 37th ACM SIGIR International Conference on Research and Development in Information Retrieval. New York, USA: ACM, 2014: 83-92. [55] DOUGLAS-COWIE E, COWIE R, SNEDDON I, et al. The HUMAINE Database: Addressing the Collection and Annotation of Naturalistic and Induced Emotional Data // Proc of the International Conference on Affective Computing and Intelligent Interaction. Washington, USA: IEEE, 2007: 488-500. [56] DOUGLAS-COWIE E, COWIE R, SCHRODER M. A New Emotion Database: Considerations, Sources and Scope // Proc of the ISCA Tutorial and Research Workshop on Speech and Emotion. New York, USA: ACM, 2000: 20-24. [57] NIU T, ZHU S A, PANG L, et al. Sentiment Analysis on Multi-view Social Data // Proc of the International Conference on Multimedia Modeling. Berlin, Germany: Springer, 2016: 15-27. [58] ZADEH A, ZELLERS R, PINCUS E, et al. MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos. IEEE Intelligent Systems, 2016, 31(6): 82-88. [59] TIAN F, LIANG H J, LI L Z, et al. Sentiment Classification in Turn-Level Interactive Chinese Texts of E-Learning Applications // Proc of the 12th IEEE International Conference on Advanced Learning Technologies. Washington, USA: IEEE, 2012: 480-484. [60] LI Y R, SU H, SHEN X Y, et al. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset // Proc of the 8th International Joint Conference on Natural Language Processing. New York, USA: ACM, 2017: 986-995. [61] CHEN S Y, HSU C C, KUO C C, et al. Emotionlines: An Emotion Corpus of Multi-party Conversations // Proc of the 11th International Conference on Language Resources and Evaluation. Stroudsburg, USA: ACL, 2018: 1252-1256. [62] ZHANG Y Z, SONG L L, SONG D W, et al. ScenarioSA: A Large Scale Conversational Database for Interactive Sentiment Analysis[J/OL]. [2019-11-12]. https://arxiv.org/pdf/1907.05562.pdf. [63] MCKEOWN G, VALSTAR M, COWIE R, et al. The SEMAINE Database: Annotated Multimodal Records of Emotionally Colored Conversations between a Person and a Limited Agent. IEEE Transactions on Affective Computing, 2012, 3(1): 5-17. [64] BUSSO C, BULUT M, LEE C C, et al. IEMOCAP: Interactive Emotional Dyadic Motion Capture Database. Language Resources and Evaluation, 2008, 42(4): 335-340. [65] PORIA S, HAZARIKA D, MAJUMDER N, et al. MELD: A Multimodal Multi-party Dataset for Emotion Recognition in Conversations // Proc of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: ACL, 2019: 527-536. [66] CHATTERJEE A, GUPTA U, CHINNAKOTLA M, et al. Understanding Emotions in Text Using Deep Learning and Big Data. Computers in Human Behavior, 2019, 93: 309-317. [67] GOLDSTEIN O S. Autonomic Failure: A Textbook of Clinical Disorders of the Autonomic Nervous System. Archives of Internal Medicine, 1985, 145(3). DOI: 10.1001/archinte.1985.003600300 41006. [68] HÄUBL G, TRIFTS V. Consumer Decision Making in Online Sho-pping Environments: The Effects of Interactive Decision Aids. Marketing Science, 2000, 19(1): 4-21. [69] LUTZ K A, LUTZ R J. Effects of Interactive Imagery on Learning: Application to Advertising. Journal of Applied Psychology, 1977, 62(4): 493-495. [70] WELLS G L, PETTY R E. The Effects of over Head Movements on Persuasion: Compatibility and Incompatibility of Responses. Basic and Applied Social Psychology, 1980, 1(3): 219-230. [71] EVANS J S, KRISHNAMURTHY V. Hidden Markov Model State Estimation with Randomly Delayed Observations. IEEE Transactions on Signal Processing, 1999, 47(8): 2157-2166. [72] ASAVATHIRATHAM C, ROY S, LESIEUTRE B, et al. The Influence Model: A Tractable Representation for the Dynamics of Networked Markov Chains. IEEE Control Systems Magazine, 2001, 21(6): 52-64. [73] REZEK I, SYKACEK P, ROBERTS S J. Learning Interaction Dynamics with Coupled Hidden Markov Models. IEE Proceedings-Science, Measurement and Technology, 2000, 147(6): 345-350. [74] CHOUDHURY T, BASU S. Modeling Conversational Dynamics as a Mixed-Memory Markov Process // SAUL L K, WEISS Y, BOTTOU L, eds. Advances in Neural Information Processing Systems 17. Cambridge, USA: The MIT Press, 2000: 281-288. [75] PAN W, DONG W, CEBRIAN M, et al. Modeling Dynamical Influence in Human Interaction: Using Data to Make Better Infe-rences about Influence within Social Systems. IEEE Signal Processing Magazine, 2007, 29(2): 77-86. [76] OJAMAA B, JOKINEN P K, MUISCHENK K. Sentiment Analysis on Conversational Texts // Proc of the 20th Nordic Conference of Computational Linguistics. Stroudsburg, USA: ACL, 2015: 233-237. [77] RUSSELL E. Real-Time Topic and Sentiment Analysis in Human-Robot Conversation. Master Dissertation. Milwaukee, USA: Marquette University, 2015. [78] MAGHILNAN S, RAJESH K M. Sentiment Analysis on Speaker Specific Speech Data // Proc of the International Conference on Intelligent Computing and Control. Washington, USA: IEEE, 2017: 223-227. [79] BHASKAR J, SRUTHI K, NEDUNGADI P. Hybrid Approach for Emotion Classification of Audio Conversation Based on Text and Speech Mining. Procedia Computer Science, 2015, 46: 635-643. [80] HUIJZER E. Identifying Effective Affective Email Responses. Bu-siness Analytics, 2017, 1: 15-40. [81] MACHOVÁ K, MARHEFKA L. Opinion Mining in Conversational Content within Web Discussions and Commentaries // Proc of the International Conference on Availability, Reliability, and Security. Berlin, Germany: Springer, 2013: 149-161. [82] HAZARIKA D, PORIA S, ZADEH A, et al. Conversational Me-mory Network for Emotion Recognition in Dyadic Dialogue Videos // Proc of the Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg, USA: ACL, 2018: 2122-2132. [83] HAZARIKA D, PORIA S, ZIMMERMANN R, et al. Emotion Re-cognition in Conversations with Transfer Learning from Generative Conversation Modeling [C/ OL]. [2019-11-12]. https://arxiv.org/pdf/1910.04980.pdf. [84] PORIA S, CAMBRIA E, HAZARIKA D, et al. Context-Depen-dent Sentiment Analysis in User-Generated Videos // Proc of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: ACL, 2017: 873-883. [85] MAJUMDER N, PORIA S, HAZARIKA D, et al. DialogueRNN: An Attentive RNN for Emotion Detection in Conversations // Proc of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2019: 6818-6825. [86] ZHONG P X, WANG D, MIAO C Y. Knowledge-Enriched Transformer for Emotion Detection in Textual Conversations // Proc of the Conference on Empirical Methods in Natural Language Proce-ssing. Stroudsburg, USA: ACL, 2019: 165-177. [87] ZHANG D, WU L Q, SUN C L, et al. Modeling both Context- and Speaker-Sensitive Dependence for Emotion Detection in Multi-speaker Conversations // Proc of the 28th International Joint Conference on Artificial Intelligence. New York, USA: ACM, 2019: 5415-5421. [88] ZHANG Y Z, LI Q C, SONG D W, et al. Quantum-Inspired Interactive Networks for Conversational Sentiment Analysis // Proc of the 28th International Joint Conference on Artificial Intelligence. New York, USA: ACM, 2019: 5436-5442. [89] CHATTERJEE A, NARAHARI K N, JOSHI M, et al. SemEval-2019 Task 3: EmoContext Contextual Emotion Detection in Text // Proc of the 13th International Workshop on Semantic Evaluation. Stroudsburg, USA: ACL, 2019: 39-48. [90] WANG Y X, HOU Y T, CHE W X, et al. From Static to Dynamic Word Representations: A Survey. International Journal of Machine Learning and Cybernetics, 2020, 3: 1-20. [91] MIKOLOV T, YIH S W, ZWEIG G. Linguistic Regularities in Continuous Space Word Representations // Proc of the Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg, USA: ACL, 2013: 746-751. [92] PETERS M E, NEUMANN M, IYYER M, et al. Deep Contextua-lized Word Representations // Proc of the Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg, USA: ACL, 2018: 2227-2237. [93] RUBLEE E, RABAUD V, KONOLIGE G. ORB: An Efficient Alternative to SIFT or SURF // Proc of the International Conference on Computer Vision. New York, USA: ACM, 2011: 2564-2571. [94] DALAL N, TRIGGS B. Histograms of Oriented Gradients for Human Detection // Proc of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2005: 886-893. [95] AI L M, WANG J, WANG X L. Multi-features Fusion Diagnosis of Tremor Based on Artificial Neural Network and D-S Evidence Theory. Signal Processing, 2008, 88(12): 2927-2935. [96] BRUZA P, WANG Z, BUSEMEYER J R. Quantum Cognition: A New Theoretical Approach to Psychology. Trends in Cognitive Sciences, 2015, 19(7): 383-393. [97] BUSEMEYER J R, BRUZA P D. Quantum Models of Cognition and Decision. Cambridge, UK: Cambridge University Press, 2012. |
|
|
|