Abstract:Comic is widely applied for metaphorizing social phenomena and expressing emotion in social media. To solve the problem of label ambiguity in multi-modal and multi-label emotion detection of comic scenes, a multi-modal and multi-label emotion detection model for comics based on two-stream network is proposed. The inter-modal information is compared using cosine similarity and combined with a self-attention mechanism to merge image features and text features. Then, the backbone of the method is a two-stream structure taking the Transformer model as the image backbone network to extract image features and taking the Roberta pre-training model as the text backbone network to extract text features. The improved cosine similarity is combined with cosine self-attention mechanism and multi-head self-attention mechanism(COS-MHSA) to extract the high-level features of the image. Finally, the multi-modal features of the high-level features and COS-MHSA are fused. The effectiveness of the proposed method is verified on EmoRecCom dataset, and the emotion detection result is presented in a visual manner.
林镇涛, 曾碧, 潘志豪, 文松. 基于双流网络的多模态多标签漫画情感检测方法[J]. 模式识别与人工智能, 2021, 34(11): 1017-1027.
LIN Zhentao, ZENG Bi , PAN Zhihao, WEN Song. Multi-modal and Multi-label Emotion Detection for Comics Based on Two-Stream Network. , 2021, 34(11): 1017-1027.
[1] AUGEREAU O, IWATA M, KISE K. A Survey of Comics Research in Computer Science[C/OL]. [2021-06-25]. https://arxiv.org/pdf/1804.05490.pdf. [2] ZHANG C, YANG Z C, HE X D, et al. Multimodal Intelligence: Representation Learning, Information Fusion, and Applications. IEEE Journal of Selected Topics in Signal Processing, 2020, 14(3): 478-493. [3] 张亚洲,戎 璐,宋大为,等.多模态情感分析研究综述.模式识别与人工智能, 2020, 33(5): 426-438. (ZHANG Y Z, RONG L, SONG D W, et al. A Survey on Multimodal Sentiment Analysis. Pattern Recognition and Artificial Intelligence, 2020, 33(5): 426-438.) [4] ZHANG J H, YIN Z, CHEN P, et al. Emotion Recognition Using Multi-modal Data and Machine Learning Techniques: A Tutorial and Review. Information Fusion, 2020, 59: 103-126. [5] TZIRAKIS P, TRIGEORGIS G, NICOLAOU M A, et al. End-to-End Multimodal Emotion Recognition Using Deep Neural Networks. IEEE Journal of Selected Topics in Signal Processing, 2017, 11(8): 1301-1309. [6] TSAI Y H, BAI S J, LIANG P P, et al. Multimodal Transformer for Unaligned Multimodal Language Sequences // Proc of the 57th An-nual Meeting of the Association for Computational Linguistics. Strouds-burg, USA: ACL, 2019: 6558-6569. [7] CIMTAY Y, EKMEKCIOGLU E. Investigating the Use of Pretrained Convolutional Neural Network on Cross-Subject and Cross-Dataset EEG Emotion Recognition. Sensors, 2020, 20(7). DOI: 10.3390/s20072034. [8] LIU C X, ZOPH B, NEUMANN M, et al. Progressive Neural Ar-chitecture Search[C/OL]. [2021-06-25]. https://arxiv.org/pdf/1712.00559.pdf. [9] ANASTASOPOULOS A, KUMAR S, LIAO H. Neural Language Mo-deling with Visual Features[C/OL]. [2021-06-25]. https://arxiv.org/pdf/1903.02930.pdf. [10] ZOPH B, LE Q V. Neural Architecture Search with Reinforcement Learning[C/OL]. [2021-06-25]. https://arxiv.org/pdf/1611.01578.pdf. [11] XU K, BA J L, KIROS R, et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention // Proc of the 32nd International Conference on Machine Learning. New York, USA: ACM, 2015: 2048-2057. [12] LI G, DUAN N, FANG Y J, et al. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-training[C/OL]. [2021-06-25]. https://arxiv.org/pdf/1908.06066.pdf. [13] YU Z, YU J, FAN J P, et al. Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2017: 1839-1848. [14] DING Z X, XIA R, YU J F. End-to-End Emotion-Cause Pair Extraction Based on Sliding Window Multi-label Learning // Proc of the Conference on Empirical Methods in Natural Language Proce-ssing. Stroudsburg, USA: ACL, 2020: 3574-3583. [15] RANJAN V, RASIWASIA N, JAWAHAR C V. Multi-label Cross-Modal Retrieval // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2015: 4094-4102. [16] ZHANG D, JU X C, LI J H, et al. Multi-modal Multi-label Emotion Detection with Modality and Label Dependence // Proc of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: ACL, 2020: 3584-3593. [17] LIU Y H, OTT M, GOYAL N, et al. RoBERTa: A Robustly Opti-mized Bert Pretraining Approach[C/OL]. [2021-06-25]. https://arxiv.org/pdf/1907.11692v1.pdf. [18] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale[C/OL]. [2021-06-25]. https://arxiv.org/pdf/2010.11929v2.pdf. [19] VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need[C/OL]. [2021-06-25]. https://arxiv.org/pdf/1706.03762.pdf. [20] LI J, XIE X R, YAN N, et al. Two Streams and Two Resolution Spectrograms Model for End-to-End Automatic Speech Recognition[C/OL]. [2021-06-25]. https://arxiv.org/pdf/2108.07980.pdf. [21] 汪 堃,雷一鸣,张军平.基于双流步态网络的跨视角步态识别.模式识别与人工智能, 2020, 33(5): 383-392. (WANG K,LEI Y M,ZHANG J P. Two-Stream Gait Network for Cross-View Gait Recognition. Pattern Recognition and Artificial Intelligence, 2020, 33(5): 383-392.) [22] WANG H H, WU X D, HUANG Z Y, et al. High-Frequency Component Helps Explain the Generalization of Convolutional Neural Networks // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 8681-8691. [23] KHAN S, NASEER M, HAYAT M, et al. Transformers in Vision: A Survey[C/OL]. [2021-06-25]. https://export.arxiv.org/pdf/2101.01169. [24] SHI T Z, LIU Z Y. Linking GloVe with Word2vec[C/OL]. [2021-06-25]. https://arxiv.org/pdf/1411.5595.pdf. [25] RADFORD A, KIM J W, HALLACY C, et al. Learning Transfe-rable Visual Models from Natural Language Supervision[C/OL]. [2021-06-25]. https://arxiv.org/pdf/2103.00020.pdf. [26] IYYER M, MANJUNATHA V, GUHA A, et al. The Amazing Mysteries of the Gutter: Drawing Inferences between Panels in Comic Book Narratives // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 7186-7195. [27] SMITH R. An Overview of the Tesseract OCR Engine // Proc of the 9th International Conference on Document Analysis and Recognition. Washington, USA: IEEE, 2007: 629-633. [28] ALHUZALI H, ANANIADOU S. SpanEmo: Casting Multi-label Emotion Classification as Span-Prediction // Proc of the 16th Conference of the European Chapter of the Association for Computa-tional Linguistics(Main Volume). Stroudsburg, USA: ACL, 2021: 1573-1584. [29] CHUNG J, GULCEHRE C, CHO K H, et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling[C/OL]. [2021-06-25]. https://export.arxiv.org/pdf/1412.3555. [30] TAN M X, LE Q V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks // Proc of the 36th International Conference on Machine Learning. New York, USA: ACM, 2019: 6105-6114. [31] XU N, MAO W J. MultisentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis // Proc of the ACM Conference on Information and Knowledge Management. New York, USA: ACM, 2017: 2399-2402. [32] KIELA D, BHOOSHAN S, FIROOZ H, et al. Supervised Multimodal Bitransformers for Classifying Images and Text[C/OL]. [2021-06-25]. https://arxiv.org/pdf/1909.02950.pdf. [33] 熊红凯,高 星,李劭辉,等.可解释化、结构化、多模态化的深度神经网络.模式识别与人工智能, 2018, 31(1): 1-11. (XIONG H K, GAO X, LI S H, et al. Interpretable Structured Multi-modal Deep Neural Network. Pattern Recognition and Artificial Intelligence, 2018, 31(1): 1-11.)