知识驱动的多模态语义理解研究综述

doi:10.16451/j.cnki.issn1003-6059.202312005

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (1101 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要基于深度学习模型的多模态学习方法已在静态、可控等简单场景下取得较优的语义理解性能,但在动态、开放等复杂场景下的泛化性仍然较低.近期已有不少研究工作尝试将类人知识引入多模态语义理解方法中,并取得不错效果.为了更深入了解当前知识驱动的多模态语义理解研究进展,文中在对相关方法进行系统调研与分析的基础上,归纳总结关系型和对齐型这两类主要的多模态知识表示框架.然后选择多个代表性应用进行具体介绍,包括图文匹配、目标检测、语义分割、视觉-语言导航等.此外,文中总结当前相关方法的优缺点并展望未来可能的发展趋势.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	郑祎豪
	郭奕君
	毋立芳
	黄岩

关键词 ：机器学习, 深度学习, 多模态语义理解, 多模态知识表示, 多模态语义分析, 知识驱动

Abstract：Multimodal learning methods based on deep learning model achieve excellent semantic understanding performance in static, controllable and simple scenarios. However, their generalization ability in dynamic, open and other complex scenarios is still unsatisfactory. Human-like knowledge is introduced into multimodal semantic understanding methods in recent research, yielding impressive results. To gain deeper understanding of the current research progress in knowledge-driven multimodal semantic understanding, two main types of multimodal knowledge representation frameworks are summarized based on systematic investigation and analysis of relevant methods in this paper. The two main types of multimodal knowledge representation frameworks are relational and aligned, respectively. Several representative applications are discussed, including image-text matching, object detection, semantic segmentation, and vision-and-language navigation. In addition, the advantages and disadvan-tages of the current methods and the possible development trend in the future are concluded.

Key words： Machine Learning Deep Learning Multimodal Semantic Understanding Multimodal Knowledge Representation Multimodal Semantic Analysis Knowledge-Driven

收稿日期: 2023-10-10

ZTFLH:

TP 391

基金资助:科技创新2030-“新一代人工智能”重大项目(No.2018AAA0100400); 国家自然科学基金项目(No.62236010)资助; 国家自然科学基金项目(No.62276261)资助

通讯作者: 黄岩,博士,副研究员,主要研究方向为计算机视觉.E-mail:huangyan2012@ia.ac.cn.

作者简介: 郑祎豪,博士研究生,主要研究方向为人工智能.E-mail:zhengyh@emails.bjut.edu.cn.
郭奕君,硕士,工程师,主要研究方向为计算机视觉.E-mail:yijun.guo@cripac.ia.ac.cn.
毋立芳,博士,教授,主要研究方向为人工智能.E-mail:lfwu@bjut.edu.cn.

引用本文:

郑祎豪, 郭奕君, 毋立芳, 黄岩. 知识驱动的多模态语义理解研究综述[J]. 模式识别与人工智能, 2023, 36(12): 1127-1138. ZHENG Yihao, GUO Yijun, WU Lifang, HUANG Yan. A Survey on Knowledge-Driven Multimodal Semantic Understanding. Pattern Recognition and Artificial Intelligence, 2023, 36(12): 1127-1138.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.202312005 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2023/V36/I12/1127

[1] CHEN Y C, LI L J, YU L C, et al.UNITER: Universal Image-Text Representation Learning // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 104-120.
[2] RADFORD A, KIM J W, HALLACY C, et al.Learning Transferable Visual Models from Natural Language Supervision // Proc of the 38th International Conference on Machine Learning. San Diego, USA: JMLR, 2021: 8748-8763.
[3] VASWANI A, SHAZEER N, PARMAR N, et al.Attention Is All You Need // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 6000-6010.
[4] KELLOGG R T.Cognitive Psychology[EB/OL]. [2023-09-20].https://us.sagepub.com/en-us/nam/cognitive-psychology/book10816.
[5] BI Y C.Dual Coding of Knowledge in the Human Brain. Trends in Cognitive Sciences, 2021, 25(10): 883-895.
[6] WU Q, SHEN C H, WANG P, et al.Image Captioning and Visual Question Answering Based on Attributes and External Knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(6): 1367-1381.
[7] LI H R, ZHU J N, ZHANG J J, et al.Keywords-Guided Abstractive Sentence Summarization. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(5): 8196-8203.
[8] BALRUŠAITIS T, AHUJA C, MORENCY L P. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(2): 423-443.
[9] SRIVASTAVA N, SALAKHUTDINOV R.Multimodal Learning with Deep Boltzmann Machines. The Journal of Machine Learning Research, 2012, 15(1): 2949-2980.
[10] SRIVASTAVA N, SALAKHUTDINOV R. Learning Representations for Multimodal Data with Deep Belief Nets [C/OL]. [2023-09-20]. http://www.cs.toronto.edu/~nitish/icml2012/paper.pdf.
[11] HUANG Y, WANG W, WANG L.Unconstrained Multimodal Multi-label Learning. IEEE Transactions on Multimedia, 2015, 17(11): 1923-1935.
[12] SHUTOVA E, KIELA D, MAILLARD J.Black Holes and White Rabbits: Metaphor Identification with Visual Features // Proc of the Conference of the North American Chapter of the Association for Computational Linguistics(Human Language Technologies). Stroudsburg, USA: ACL, 2016: 160-170.
[13] MORVANT E, HABRARD A, AYACHE S.Majority Vote of Diverse Classifiers for Late Fusion // Proc of the Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition and Structural and Syntactic Pattern Recognition. Berlin, Germany: Springer, 2014: 153-162.
[14] GLODEK M, TSCHECHNE S, LAYHER G, et al.Multiple Cla-ssifier Systems for the Classification of Audio-Visual Emotional States// Proc of the 4th International Conference Affective Computing and Intelligent Interaction. Berlin, Germany: Springer, 2011: 359-368.
[15] GONEN M, ALPAYDIN E.Multiple Kernel Learning Algorithms. The Journal of Machine Learning Research, 2011, 12: 2211-2268.
[16] LIU F Y, ZHOU L P, SHEN C H, et al.Multiple Kernel Learning in the Primal for Multimodal Alzheimer's Disease Classification. IEEE Journal of Biomedical and Health Informatics, 2014, 18(3): 984-990.
[17] JIANG X Y, WU F, ZHANG Y, et al.The Classification of Multi-modal Data with Hidden Conditional Random Field. Pattern Reco-gnition Letters, 2015, 51: 63-69.
[18] GURBAN M, THIRAN J P, DRUGMAN T, et al.Dynamic Modality Weighting for Multi-stream HMMs Inaudio-Visual Speech Recognition // Proc of the 10th International Conference on Multimodal Interfaces. New York, USA: ACM, 2008: 237-240.
[19] POTAMIANOS G, NETI C.Audio-Visual Speech Recognition in Challenging Environments // Proc of the 8th European Conference on Speech Communication and Technology. Berlin, Germany: Springer, 2023: 1293-1296.
[20] GAO H Y, MAO J H, ZHOU J, et al.Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question // Proc of the 28th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2015: 2296-2304.
[21] PLUMMER B A, WANG L W, CERVANTES C M, et al.Flickr-30k Entities: Collecting Region-to-Phrase Correspondences for Ri-cher Image-to-Sentence Models // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2015: 2641-2649.
[22] ANDREW G, ARORA R, BILMES J, et al.Deep Canonical Co-rrelation Analysis // Proc of the 30th International Conference on Machine Learning. San Diego, USA: JMLR, 2013: 1247-1255.
[23] GAO Q X, LIAN H H, WANG Q Q, et al.Cross-Modal Subspace Clustering via Deep Canonical Correlation Analysis. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(4): 3938-3945.
[24] ANGUERA X, LUQUE J, GRACIA C. Audio-to-Text Alignment for Speech Recognition with Very Limited Resources[C/OL]. [2023-09-20]. http://www.xavieranguera.com/papers/IS2014_phonealignment.pdf.
[25] HAUBOLD A, KENDER J R.Alignment of Speech to Highly Imperfect Text Transcriptions // Proc of the IEEE International Conference on Multimedia and Expo. Washington, USA: IEEE, 2007: 224-227.
[26] ZHU Y K, KIROS R, ZEMEL R, et al.Aligning Books and Mo-vies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2015: 19-27.
[27] ZHENG Z D, ZHENG L, GARRETT M, et al.Dual-Path Convolutional Image-Text Embeddings with Instance Loss. ACM Transactions on Multimedia Computing, Communications, and Applications, 2020, 16(2). DOI: 10.1145/3383184.
[28] WANG W N, HUANG Y, WANG L.Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Lear-ning Model // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 334-343.
[29] BAHDANAU D, CHO K, BENGIO Y. Neural Machine Translation by Jointly Learning to Align and Translate [C/OL]. [2023-09-20]. https://arxiv.org/pdf/1409.0473.pdf.
[30] XU K, BA J L, KIROS R, et al.Show, Attend and Tell: Neural Image Caption Generation with Visual Attention// Proc of the 32nd International Conference on Machine Learning. San Diego, USA: JMLR, 2015: 2048-2057.
[31] BREGLER C, COVELL M, SLANEY M.Video Rewrite: Driving Visual Speech with Audio // Proc of the 24th Annual Conference on Computer Graphics and Interactive Techniques. New York, USA: ACM, 1997: 353-360.
[32] YAGCIOGLU S, ERDEM E, ERDEM A, et al.A Distributed Re-presentation Based Query Expansion Approach for Image Captioning // Proc of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Na-tural Language Processing(Short Papers). Stroudsburg, USA: ACL, 2015: 106-111.
[33] FARHADI A, HEJRATI M, SADEGHI M A,et al. Every Picture Tells a Story: Generating Sentences from Images // Proc of the European Conference on Computer Vision. Berlin, Germany: Sprin-ger, 2010: 15-29.
[34] XU R, XIONG C M, CHEN W, et al.Jointly Modeling Deep Vi-deo and Compositional Text to Bridge Vision and Language in a Unified Framework. Proceedings of the AAAI Conference on Artificial Intelligence, 2015, 29(1): 2346-2352.
[35] NGIAM J, KHOSLA A, KIM M, et al.Multimodal Deep Learning // Proc of the 28th International Conference on Machine Learning. San Diego, USA: JMLR, 2011: 689-696.
[36] JAQUES N, TAYLOR S, SANO A, et al.Multimodal Autoenco-der: A Deep Learning Approach to Filling in Missing Sensor Data and Enabling Better Mood Prediction // Proc of the 7th International Conference on Affective Computing and Intelligent Interaction. Washington, USA: IEEE, 2017: 202-208.
[37] SHELHAMER E, LONG J, DARRELL T.Fully Convolutional Networks for Semantic Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 640-651.
[38] RONNEBERGER O, FISCHER P, BROX T.U-Net: Convolutional Networks for Biomedical Image Segmentation // Proc of the International Conference on Medical Image Computing and Computer-Assisted Intervention. Berlin, Germany: Springer, 2015: 234-241.
[39] VINYALS O, TOSHEV A, BENGIO S, et al.Show and Tell: A Neural Image Caption Generator // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2015: 3156-3164.
[40] DONAHUE J, HENDRICKS L A, GUADARRAMA S, et al.Long-Term Recurrent Convolutional Networks for Visual Recognition and Description // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2015: 2625-2634.
[41] FERRADA S, BUSTOS B, HOGAN A.IMGpedia: A Linked Da-taset with Content-Based Analysis of Wikimedia Images // Proc of the 16th International Semantic Web Conference. Berlin, Germany: Springer, 2017: 84-93.
[42] LIU Y, LI H, GARCIA-DURAN A, et al.MMKG: Multi-modal Knowledge Graphs// Proc of the 16th International Semantic Web Conference. Berlin, Germany: Springer, 2019: 459-474.
[43] HUANG Y, WANG Y, ZENG Y, et al. MACK: Multimodal Aligned Conceptual Knowledge for Unpaired Image-Text Matching[C/OL].[2023-09-20]. https://openreview.net/pdf?id=7lf58jWnDIS.
[44] HUANG Y, WANG J D, WANG L.Few-Shot Image and Sentence Matching via Aligned Cross-Modal Memory. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(6): 2968-2983.
[45] BOLLACKER K, EVANS C, PARITOSH P, et al.Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge // Proc of the ACM SIGMOD International Conference on Management of Data. New York, USA: ACM, 2008: 1247-1250.
[46] LEHMANN J, ISELE R, JAKOB M, et al.DBpedia-A Large-Scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web, 2015, 6(2): 167-195.
[47] KRISHNA R, ZHU Y K, GROH O, et al.Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision, 2017, 123(1): 32-73.
[48] SHI B J, JI L, LU P, et al.Knowledge Aware Semantic Concept Expansion for Image-Text Matching // Proc of the 28th International Joint Conference on Artificial Intelligence. San Francisco, USA: IJCAI, 2019: 5182-5189.
[49] MI L, LI S, CHAPPUIS C, et al.Knowledge-Aware Cross-Modal Text-Image Retrieval for Remote Sensing Images // Proc of the 2nd Workshop on Complex Data Challenges in Earth Observation. San Francisco, USA: IJCAI, 2022: 4-10.
[50] LIU C X, MAO Z D, ZHANG T Z, et al.Graph Structured Network for Image-Text Matching // Proc of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 10918-10927.
[51] YOUNG P, LAI A, HODOSH M, et al.From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Infe-rence over Event Descriptions. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78.
[52] HUANG Y, WU Q, SONG C F, et al.Learning Semantic Concepts and Order for Image and Sentence Matching // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 6163-6171.
[53] FANG Y, KUAN K, LIN J, et al.Object Detection Meets Know-ledge Graphs // Proc of the 26th International Joint Conference on Artificial Intelligence. San Francisco, USA: IJCAI, 2017: 1661-1667.
[54] WANG J W, CHEN D Y.Few-Shot Object Detection Method Based on Knowledge Reasoning. Electronics, 2022, 11(9). DOI: 10.3390/electronics11091327.
[55] YANG A J, LIN S H, YEH C H, et al.Context Matters: Distilling Knowledge Graph for Enhanced Object Detection. IEEE Transactions on Multimedia, 2023. DOI: 10.1109/TMM.2023.3266897.
[56] RAMBHATLA S S, CHELLAPPA R, SHRIVASTAVA A.The Pur-suit of Knowledge: Discovering and Localizing Novel Categories Using Dual Memory // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 9133-9143.
[57] XIONG Y Y, YANG P P, LIU C L.One-Stage Open Set Object Detection with Prototype Learning // Proc of the 28th International Conference on Neural Information Processing. Berlin, Germany: Springer, 2021: 279-291.
[58] LIU W, ANGUELOV D, ERHAN D, et al.SSD: Single Shot Multibox Detector // Proc of the 14th European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 21-37.
[59] HUANG L J, HUANG Y, OUYANG W L, et al.Two-Branch Relational Prototypical Network for Weakly Supervised Temporal Action Localization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(9): 5729-5746.
[60] EVERINGHAM M, WINN J. The Pascal Visual Object Classes Cha-llenge(VOC2007) Development Kit[C/OL]. [2023-09-20]. http://host.robots.ox.ac.uk/pascal/VOC/voc2007/devkit_doc_07-Jun-2007.pdf
[61] IDREES H, ZAMIR A R, JIANG Y G, et al.The Thumos Cha-llenge on Action Recognition for Videos "In the Wild". Computer Vision and Image Understanding, 2017, 155. DOI: 10.1016/j.cviu.2016.10.018.
[62] CHEN S J, LI Z X, YANG X W.Knowledge Reasoning for Semantic Segmentation // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2021: 2340-2344.
[63] LIANG X D, HU Z T, ZHANG H, et al.Symbolic Graph Reaso-ning Meets Convolutions // Proc of the 32nd International Confe-rence on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2018: 1858-1868.
[64] XIE G S, LIU J, XIONG H, et al.Scale-Aware Graph Neural Network for Few-Shot Semantic Segmentation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 5471-5480.
[65] FAN J S, ZHANG Z X.Memory-Based Cross-Image Contexts for Weakly Supervised Semantic Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(5): 6006-6020.
[66] MAO B J, ZHANG X B, WANG L F, et al.Learning from the Target: Dual Prototype Network for Few Shot Semantic Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(2): 1953-1961.
[67] YANG J H, HUANG Y, NIU K, et al.Actor and Action Modular Network for Text-Based Video Segmentation. IEEE Transactions on Image Processing, 2022, 31: 4474-4489.
[68] SHABAN A, BANSAL S, LIU Z, et al. One-Shot Learning for Semantic Segmentation[C/OL].[2023-09-20]. https://arxiv.org/pdf/1709.03410.pdf.
[69] EVERINGHAM M, WINN J. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Development Kit[C/OL]. [2023-09-20]. http://host.robots.ox.ac.uk/pascal/VOC/voc2012/devkit_doc.pdf.
[70] XU C L, HSIEH S H, XIONG C M, et al.Can Humans Fly? Action Understanding with Multiple Classes of Actors // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2015: 2264-2273.
[71] LIN G S, MILAN A, SHEN C H, et al.RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 5168-5177.
[72] LI X Y, WANG Z H, YANG J H, et al.KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 2583-2592.
[73] LI X, ZHANG Y, YUAN W L, et al.Incorporating External Know-ledge Reasoning for Vision-and-Language Navigation with Assis-tant's Help. Applied Sciences, 2022, 12(14). DOI: 10.3390/app12147053.
[74] LIN C, JIANG Y, CAI J F, et al.Multimodal Transformer with Variable-Length Memory for Vision-and-Language Navigation // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2022: 380-397.
[75] LIN B Q, ZHU Y, CHEN Z C, et al.Adapt: Vision-Language Navigation with Modality-Aligned Action Prompts // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 15375-15385.
[76] AN D, QI Y K, LI Y G, et al. BEVBert: Topo-Metric Map Pre-training for Language-Guided Navigation[C/OL].[2023-09-20]. https://arxiv.org/pdf/2212.04385v2.pdf.
[77] ANDERSON P, WU Q, TENEY D, et al.Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 3674-3683.
[78] QI Y K, WU Q, ANDERSON P, et al.REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 9979-9988.
[79] CHEN S Z, GUHUR P L, SCHMID C, et al. History Aware Multimodal Transformer for Vision-and-Language Navigation[C/OL].
[2023-09-20]. https://inria.hal.science/hal-03464975/file/hamt_paper.pdf.