基于双流网络的多模态多标签漫画情感检测方法

doi:10.16451/j.cnki.issn1003-6059.202111005

摘要
图/表
参考文献
相关文章 (3)

全文: PDF (3921 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要近年来,社交媒体常会以漫画的形式隐喻社会现象并倾述情感,为了解决漫画场景下多模态多标签情感识别存在的标签歧义问题,文中提出基于双流结构的多模态多标签漫画情感检测方法.使用余弦相似度对比模态间信息,并结合自注意力机制,交叉融合图像特征和文本特征.该方法主干为双流结构,使用Transformer模型作为图像的主干网络提取图像特征,利用Roberta预训练模型作为文本的主干网络提取文本特征.基于余弦相似度结合多头自注意力机制(COS-MHSA)提取图像的高层特征,最后融合高层特征和COS-MHSA多模态特征.在EmoRecCom漫画数据集上的实验验证文中方法的有效性,并给出方法对于情感检测的可视化结果.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	林镇涛
	曾碧
	潘志豪
	文松

关键词 ：漫画情感检测, 余弦相似度, 多头自注意力机制, 多模态融合

Abstract：Comic is widely applied for metaphorizing social phenomena and expressing emotion in social media. To solve the problem of label ambiguity in multi-modal and multi-label emotion detection of comic scenes, a multi-modal and multi-label emotion detection model for comics based on two-stream network is proposed. The inter-modal information is compared using cosine similarity and combined with a self-attention mechanism to merge image features and text features. Then, the backbone of the method is a two-stream structure taking the Transformer model as the image backbone network to extract image features and taking the Roberta pre-training model as the text backbone network to extract text features. The improved cosine similarity is combined with cosine self-attention mechanism and multi-head self-attention mechanism(COS-MHSA) to extract the high-level features of the image. Finally, the multi-modal features of the high-level features and COS-MHSA are fused. The effectiveness of the proposed method is verified on EmoRecCom dataset, and the emotion detection result is presented in a visual manner.

Key words： Comic Emotion Detection Cosine Similarity Multi-head Self-Attention Mechanism Multi-modal Fusion

收稿日期: 2021-07-05

ZTFLH:

TP 391

基金资助:国家自然科学基金项目(No.61672169)、广东省自然科学基金项目(No.2021A1515012233)资助

通讯作者: 曾碧,博士,教授,主要研究方向为机器学习、大数据应用.E-mail:zb9215@gdut.edu.cn.

作者简介: 林镇涛,硕士研究生,主要研究方向为多模态情感分析、模式识别.E-mail:2112005050@mail2.gdut.edu.cn.
潘志豪,硕士研究生,主要研究方向为自然语言处理、情感分析.E-mail:pzh@mail2.gdut.edu.cn.
文松,硕士研究生,主要研究方向为多模态融合、大数据.E-mail:ws@mail2.gdut.edu.cn.

引用本文:

林镇涛, 曾碧, 潘志豪, 文松. 基于双流网络的多模态多标签漫画情感检测方法[J]. 模式识别与人工智能, 2021, 34(11): 1017-1027. LIN Zhentao, ZENG Bi , PAN Zhihao, WEN Song. Multi-modal and Multi-label Emotion Detection for Comics Based on Two-Stream Network. , 2021, 34(11): 1017-1027.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.202111005 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2021/V34/I11/1017

[1] AUGEREAU O, IWATA M, KISE K. A Survey of Comics Research in Computer Science[C/OL]. [2021-06-25]. https://arxiv.org/pdf/1804.05490.pdf.
[2] ZHANG C, YANG Z C, HE X D, et al. Multimodal Intelligence: Representation Learning, Information Fusion, and Applications. IEEE Journal of Selected Topics in Signal Processing, 2020, 14(3): 478-493.
[3] 张亚洲,戎璐,宋大为,等.多模态情感分析研究综述.模式识别与人工智能, 2020, 33(5): 426-438.
(ZHANG Y Z, RONG L, SONG D W, et al. A Survey on Multimodal Sentiment Analysis. Pattern Recognition and Artificial Intelligence, 2020, 33(5): 426-438.)
[4] ZHANG J H, YIN Z, CHEN P, et al. Emotion Recognition Using Multi-modal Data and Machine Learning Techniques: A Tutorial and Review. Information Fusion, 2020, 59: 103-126.
[5] TZIRAKIS P, TRIGEORGIS G, NICOLAOU M A, et al. End-to-End Multimodal Emotion Recognition Using Deep Neural Networks. IEEE Journal of Selected Topics in Signal Processing, 2017, 11(8): 1301-1309.
[6] TSAI Y H, BAI S J, LIANG P P, et al. Multimodal Transformer for Unaligned Multimodal Language Sequences // Proc of the 57th An-nual Meeting of the Association for Computational Linguistics. Strouds-burg, USA: ACL, 2019: 6558-6569.
[7] CIMTAY Y, EKMEKCIOGLU E. Investigating the Use of Pretrained Convolutional Neural Network on Cross-Subject and Cross-Dataset EEG Emotion Recognition. Sensors, 2020, 20(7). DOI: 10.3390/s20072034.
[8] LIU C X, ZOPH B, NEUMANN M, et al. Progressive Neural Ar-chitecture Search[C/OL]. [2021-06-25]. https://arxiv.org/pdf/1712.00559.pdf.
[9] ANASTASOPOULOS A, KUMAR S, LIAO H. Neural Language Mo-deling with Visual Features[C/OL]. [2021-06-25]. https://arxiv.org/pdf/1903.02930.pdf.
[10] ZOPH B, LE Q V. Neural Architecture Search with Reinforcement Learning[C/OL]. [2021-06-25]. https://arxiv.org/pdf/1611.01578.pdf.
[11] XU K, BA J L, KIROS R, et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention // Proc of the 32nd International Conference on Machine Learning. New York, USA: ACM, 2015: 2048-2057.
[12] LI G, DUAN N, FANG Y J, et al. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-training[C/OL]. [2021-06-25]. https://arxiv.org/pdf/1908.06066.pdf.
[13] YU Z, YU J, FAN J P, et al. Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2017: 1839-1848.
[14] DING Z X, XIA R, YU J F. End-to-End Emotion-Cause Pair Extraction Based on Sliding Window Multi-label Learning // Proc of the Conference on Empirical Methods in Natural Language Proce-ssing. Stroudsburg, USA: ACL, 2020: 3574-3583.
[15] RANJAN V, RASIWASIA N, JAWAHAR C V. Multi-label Cross-Modal Retrieval // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2015: 4094-4102.
[16] ZHANG D, JU X C, LI J H, et al. Multi-modal Multi-label Emotion Detection with Modality and Label Dependence // Proc of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, USA: ACL, 2020: 3584-3593.
[17] LIU Y H, OTT M, GOYAL N, et al. RoBERTa: A Robustly Opti-mized Bert Pretraining Approach[C/OL]. [2021-06-25]. https://arxiv.org/pdf/1907.11692v1.pdf.
[18] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale[C/OL]. [2021-06-25]. https://arxiv.org/pdf/2010.11929v2.pdf.
[19] VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need[C/OL]. [2021-06-25]. https://arxiv.org/pdf/1706.03762.pdf.
[20] LI J, XIE X R, YAN N, et al. Two Streams and Two Resolution Spectrograms Model for End-to-End Automatic Speech Recognition[C/OL]. [2021-06-25]. https://arxiv.org/pdf/2108.07980.pdf.
[21] 汪堃,雷一鸣,张军平.基于双流步态网络的跨视角步态识别.模式识别与人工智能, 2020, 33(5): 383-392.
(WANG K,LEI Y M,ZHANG J P. Two-Stream Gait Network for Cross-View Gait Recognition. Pattern Recognition and Artificial Intelligence, 2020, 33(5): 383-392.)
[22] WANG H H, WU X D, HUANG Z Y, et al. High-Frequency Component Helps Explain the Generalization of Convolutional Neural Networks // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 8681-8691.
[23] KHAN S, NASEER M, HAYAT M, et al. Transformers in Vision: A Survey[C/OL]. [2021-06-25]. https://export.arxiv.org/pdf/2101.01169.
[24] SHI T Z, LIU Z Y. Linking GloVe with Word2vec[C/OL]. [2021-06-25]. https://arxiv.org/pdf/1411.5595.pdf.
[25] RADFORD A, KIM J W, HALLACY C, et al. Learning Transfe-rable Visual Models from Natural Language Supervision[C/OL]. [2021-06-25]. https://arxiv.org/pdf/2103.00020.pdf.
[26] IYYER M, MANJUNATHA V, GUHA A, et al. The Amazing Mysteries of the Gutter: Drawing Inferences between Panels in Comic Book Narratives // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 7186-7195.
[27] SMITH R. An Overview of the Tesseract OCR Engine // Proc of the 9th International Conference on Document Analysis and Recognition. Washington, USA: IEEE, 2007: 629-633.
[28] ALHUZALI H, ANANIADOU S. SpanEmo: Casting Multi-label Emotion Classification as Span-Prediction // Proc of the 16th Conference of the European Chapter of the Association for Computa-tional Linguistics(Main Volume). Stroudsburg, USA: ACL, 2021:
1573-1584.
[29] CHUNG J, GULCEHRE C, CHO K H, et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling[C/OL]. [2021-06-25]. https://export.arxiv.org/pdf/1412.3555.
[30] TAN M X, LE Q V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks // Proc of the 36th International Conference on Machine Learning. New York, USA: ACM, 2019: 6105-6114.
[31] XU N, MAO W J. MultisentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis // Proc of the ACM Conference on Information and Knowledge Management. New York, USA: ACM, 2017: 2399-2402.
[32] KIELA D, BHOOSHAN S, FIROOZ H, et al. Supervised Multimodal Bitransformers for Classifying Images and Text[C/OL]. [2021-06-25]. https://arxiv.org/pdf/1909.02950.pdf.
[33] 熊红凯,高星,李劭辉,等.可解释化、结构化、多模态化的深度神经网络.模式识别与人工智能, 2018, 31(1): 1-11.
(XIONG H K, GAO X, LI S H, et al. Interpretable Structured Multi-modal Deep Neural Network. Pattern Recognition and Artificial Intelligence, 2018, 31(1): 1-11.)