A Survey on Knowledge-Driven Multimodal Semantic Understanding
ZHENG Yihao1, GUO Yijun2, WU Lifang1, HUANG Yan3
1. Faculty of Information Technology, Beijing University of Technology, Beijing 100124; 2. Center for Research on Intelligent Perception and Computing, Institute of Automation, Chinese Academy of Sciences, Beijing 100191; 3. State Key Laboratory for Multi-modal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100191
Abstract:Multimodal learning methods based on deep learning model achieve excellent semantic understanding performance in static, controllable and simple scenarios. However, their generalization ability in dynamic, open and other complex scenarios is still unsatisfactory. Human-like knowledge is introduced into multimodal semantic understanding methods in recent research, yielding impressive results. To gain deeper understanding of the current research progress in knowledge-driven multimodal semantic understanding, two main types of multimodal knowledge representation frameworks are summarized based on systematic investigation and analysis of relevant methods in this paper. The two main types of multimodal knowledge representation frameworks are relational and aligned, respectively. Several representative applications are discussed, including image-text matching, object detection, semantic segmentation, and vision-and-language navigation. In addition, the advantages and disadvan-tages of the current methods and the possible development trend in the future are concluded.
[1] CHEN Y C, LI L J, YU L C, et al.UNITER: Universal Image-Text Representation Learning // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 104-120. [2] RADFORD A, KIM J W, HALLACY C, et al.Learning Transferable Visual Models from Natural Language Supervision // Proc of the 38th International Conference on Machine Learning. San Diego, USA: JMLR, 2021: 8748-8763. [3] VASWANI A, SHAZEER N, PARMAR N, et al.Attention Is All You Need // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 6000-6010. [4] KELLOGG R T.Cognitive Psychology[EB/OL]. [2023-09-20].https://us.sagepub.com/en-us/nam/cognitive-psychology/book10816. [5] BI Y C.Dual Coding of Knowledge in the Human Brain. Trends in Cognitive Sciences, 2021, 25(10): 883-895. [6] WU Q, SHEN C H, WANG P, et al.Image Captioning and Visual Question Answering Based on Attributes and External Knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(6): 1367-1381. [7] LI H R, ZHU J N, ZHANG J J, et al.Keywords-Guided Abstractive Sentence Summarization. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(5): 8196-8203. [8] BALRUŠAITIS T, AHUJA C, MORENCY L P. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(2): 423-443. [9] SRIVASTAVA N, SALAKHUTDINOV R.Multimodal Learning with Deep Boltzmann Machines. The Journal of Machine Learning Research, 2012, 15(1): 2949-2980. [10] SRIVASTAVA N, SALAKHUTDINOV R. Learning Representations for Multimodal Data with Deep Belief Nets [C/OL]. [2023-09-20]. http://www.cs.toronto.edu/~nitish/icml2012/paper.pdf. [11] HUANG Y, WANG W, WANG L.Unconstrained Multimodal Multi-label Learning. IEEE Transactions on Multimedia, 2015, 17(11): 1923-1935. [12] SHUTOVA E, KIELA D, MAILLARD J.Black Holes and White Rabbits: Metaphor Identification with Visual Features // Proc of the Conference of the North American Chapter of the Association for Computational Linguistics(Human Language Technologies). Stroudsburg, USA: ACL, 2016: 160-170. [13] MORVANT E, HABRARD A, AYACHE S.Majority Vote of Diverse Classifiers for Late Fusion // Proc of the Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition and Structural and Syntactic Pattern Recognition. Berlin, Germany: Springer, 2014: 153-162. [14] GLODEK M, TSCHECHNE S, LAYHER G, et al.Multiple Cla-ssifier Systems for the Classification of Audio-Visual Emotional States// Proc of the 4th International Conference Affective Computing and Intelligent Interaction. Berlin, Germany: Springer, 2011: 359-368. [15] GONEN M, ALPAYDIN E.Multiple Kernel Learning Algorithms. The Journal of Machine Learning Research, 2011, 12: 2211-2268. [16] LIU F Y, ZHOU L P, SHEN C H, et al.Multiple Kernel Learning in the Primal for Multimodal Alzheimer's Disease Classification. IEEE Journal of Biomedical and Health Informatics, 2014, 18(3): 984-990. [17] JIANG X Y, WU F, ZHANG Y, et al.The Classification of Multi-modal Data with Hidden Conditional Random Field. Pattern Reco-gnition Letters, 2015, 51: 63-69. [18] GURBAN M, THIRAN J P, DRUGMAN T, et al.Dynamic Modality Weighting for Multi-stream HMMs Inaudio-Visual Speech Recognition // Proc of the 10th International Conference on Multimodal Interfaces. New York, USA: ACM, 2008: 237-240. [19] POTAMIANOS G, NETI C.Audio-Visual Speech Recognition in Challenging Environments // Proc of the 8th European Conference on Speech Communication and Technology. Berlin, Germany: Springer, 2023: 1293-1296. [20] GAO H Y, MAO J H, ZHOU J, et al.Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question // Proc of the 28th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2015: 2296-2304. [21] PLUMMER B A, WANG L W, CERVANTES C M, et al.Flickr-30k Entities: Collecting Region-to-Phrase Correspondences for Ri-cher Image-to-Sentence Models // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2015: 2641-2649. [22] ANDREW G, ARORA R, BILMES J, et al.Deep Canonical Co-rrelation Analysis // Proc of the 30th International Conference on Machine Learning. San Diego, USA: JMLR, 2013: 1247-1255. [23] GAO Q X, LIAN H H, WANG Q Q, et al.Cross-Modal Subspace Clustering via Deep Canonical Correlation Analysis. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(4): 3938-3945. [24] ANGUERA X, LUQUE J, GRACIA C. Audio-to-Text Alignment for Speech Recognition with Very Limited Resources[C/OL]. [2023-09-20]. http://www.xavieranguera.com/papers/IS2014_phonealignment.pdf. [25] HAUBOLD A, KENDER J R.Alignment of Speech to Highly Imperfect Text Transcriptions // Proc of the IEEE International Conference on Multimedia and Expo. Washington, USA: IEEE, 2007: 224-227. [26] ZHU Y K, KIROS R, ZEMEL R, et al.Aligning Books and Mo-vies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2015: 19-27. [27] ZHENG Z D, ZHENG L, GARRETT M, et al.Dual-Path Convolutional Image-Text Embeddings with Instance Loss. ACM Transactions on Multimedia Computing, Communications, and Applications, 2020, 16(2). DOI: 10.1145/3383184. [28] WANG W N, HUANG Y, WANG L.Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Lear-ning Model // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 334-343. [29] BAHDANAU D, CHO K, BENGIO Y. Neural Machine Translation by Jointly Learning to Align and Translate [C/OL]. [2023-09-20]. https://arxiv.org/pdf/1409.0473.pdf. [30] XU K, BA J L, KIROS R, et al.Show, Attend and Tell: Neural Image Caption Generation with Visual Attention// Proc of the 32nd International Conference on Machine Learning. San Diego, USA: JMLR, 2015: 2048-2057. [31] BREGLER C, COVELL M, SLANEY M.Video Rewrite: Driving Visual Speech with Audio // Proc of the 24th Annual Conference on Computer Graphics and Interactive Techniques. New York, USA: ACM, 1997: 353-360. [32] YAGCIOGLU S, ERDEM E, ERDEM A, et al.A Distributed Re-presentation Based Query Expansion Approach for Image Captioning // Proc of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Na-tural Language Processing(Short Papers). Stroudsburg, USA: ACL, 2015: 106-111. [33] FARHADI A, HEJRATI M, SADEGHI M A,et al. Every Picture Tells a Story: Generating Sentences from Images // Proc of the European Conference on Computer Vision. Berlin, Germany: Sprin-ger, 2010: 15-29. [34] XU R, XIONG C M, CHEN W, et al.Jointly Modeling Deep Vi-deo and Compositional Text to Bridge Vision and Language in a Unified Framework. Proceedings of the AAAI Conference on Artificial Intelligence, 2015, 29(1): 2346-2352. [35] NGIAM J, KHOSLA A, KIM M, et al.Multimodal Deep Learning // Proc of the 28th International Conference on Machine Learning. San Diego, USA: JMLR, 2011: 689-696. [36] JAQUES N, TAYLOR S, SANO A, et al.Multimodal Autoenco-der: A Deep Learning Approach to Filling in Missing Sensor Data and Enabling Better Mood Prediction // Proc of the 7th International Conference on Affective Computing and Intelligent Interaction. Washington, USA: IEEE, 2017: 202-208. [37] SHELHAMER E, LONG J, DARRELL T.Fully Convolutional Networks for Semantic Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 640-651. [38] RONNEBERGER O, FISCHER P, BROX T.U-Net: Convolutional Networks for Biomedical Image Segmentation // Proc of the International Conference on Medical Image Computing and Computer-Assisted Intervention. Berlin, Germany: Springer, 2015: 234-241. [39] VINYALS O, TOSHEV A, BENGIO S, et al.Show and Tell: A Neural Image Caption Generator // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2015: 3156-3164. [40] DONAHUE J, HENDRICKS L A, GUADARRAMA S, et al.Long-Term Recurrent Convolutional Networks for Visual Recognition and Description // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2015: 2625-2634. [41] FERRADA S, BUSTOS B, HOGAN A.IMGpedia: A Linked Da-taset with Content-Based Analysis of Wikimedia Images // Proc of the 16th International Semantic Web Conference. Berlin, Germany: Springer, 2017: 84-93. [42] LIU Y, LI H, GARCIA-DURAN A, et al.MMKG: Multi-modal Knowledge Graphs// Proc of the 16th International Semantic Web Conference. Berlin, Germany: Springer, 2019: 459-474. [43] HUANG Y, WANG Y, ZENG Y, et al. MACK: Multimodal Aligned Conceptual Knowledge for Unpaired Image-Text Matching[C/OL].[2023-09-20]. https://openreview.net/pdf?id=7lf58jWnDIS. [44] HUANG Y, WANG J D, WANG L.Few-Shot Image and Sentence Matching via Aligned Cross-Modal Memory. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(6): 2968-2983. [45] BOLLACKER K, EVANS C, PARITOSH P, et al.Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge // Proc of the ACM SIGMOD International Conference on Management of Data. New York, USA: ACM, 2008: 1247-1250. [46] LEHMANN J, ISELE R, JAKOB M, et al.DBpedia-A Large-Scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web, 2015, 6(2): 167-195. [47] KRISHNA R, ZHU Y K, GROH O, et al.Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision, 2017, 123(1): 32-73. [48] SHI B J, JI L, LU P, et al.Knowledge Aware Semantic Concept Expansion for Image-Text Matching // Proc of the 28th International Joint Conference on Artificial Intelligence. San Francisco, USA: IJCAI, 2019: 5182-5189. [49] MI L, LI S, CHAPPUIS C, et al.Knowledge-Aware Cross-Modal Text-Image Retrieval for Remote Sensing Images // Proc of the 2nd Workshop on Complex Data Challenges in Earth Observation. San Francisco, USA: IJCAI, 2022: 4-10. [50] LIU C X, MAO Z D, ZHANG T Z, et al.Graph Structured Network for Image-Text Matching // Proc of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 10918-10927. [51] YOUNG P, LAI A, HODOSH M, et al.From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Infe-rence over Event Descriptions. Transactions of the Association for Computational Linguistics, 2014, 2: 67-78. [52] HUANG Y, WU Q, SONG C F, et al.Learning Semantic Concepts and Order for Image and Sentence Matching // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 6163-6171. [53] FANG Y, KUAN K, LIN J, et al.Object Detection Meets Know-ledge Graphs // Proc of the 26th International Joint Conference on Artificial Intelligence. San Francisco, USA: IJCAI, 2017: 1661-1667. [54] WANG J W, CHEN D Y.Few-Shot Object Detection Method Based on Knowledge Reasoning. Electronics, 2022, 11(9). DOI: 10.3390/electronics11091327. [55] YANG A J, LIN S H, YEH C H, et al.Context Matters: Distilling Knowledge Graph for Enhanced Object Detection. IEEE Transactions on Multimedia, 2023. DOI: 10.1109/TMM.2023.3266897. [56] RAMBHATLA S S, CHELLAPPA R, SHRIVASTAVA A.The Pur-suit of Knowledge: Discovering and Localizing Novel Categories Using Dual Memory // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 9133-9143. [57] XIONG Y Y, YANG P P, LIU C L.One-Stage Open Set Object Detection with Prototype Learning // Proc of the 28th International Conference on Neural Information Processing. Berlin, Germany: Springer, 2021: 279-291. [58] LIU W, ANGUELOV D, ERHAN D, et al.SSD: Single Shot Multibox Detector // Proc of the 14th European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 21-37. [59] HUANG L J, HUANG Y, OUYANG W L, et al.Two-Branch Relational Prototypical Network for Weakly Supervised Temporal Action Localization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(9): 5729-5746. [60] EVERINGHAM M, WINN J. The Pascal Visual Object Classes Cha-llenge(VOC2007) Development Kit[C/OL]. [2023-09-20]. http://host.robots.ox.ac.uk/pascal/VOC/voc2007/devkit_doc_07-Jun-2007.pdf [61] IDREES H, ZAMIR A R, JIANG Y G, et al.The Thumos Cha-llenge on Action Recognition for Videos "In the Wild". Computer Vision and Image Understanding, 2017, 155. DOI: 10.1016/j.cviu.2016.10.018. [62] CHEN S J, LI Z X, YANG X W.Knowledge Reasoning for Semantic Segmentation // Proc of the IEEE International Conference on Acoustics, Speech and Signal Processing. Washington, USA: IEEE, 2021: 2340-2344. [63] LIANG X D, HU Z T, ZHANG H, et al.Symbolic Graph Reaso-ning Meets Convolutions // Proc of the 32nd International Confe-rence on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2018: 1858-1868. [64] XIE G S, LIU J, XIONG H, et al.Scale-Aware Graph Neural Network for Few-Shot Semantic Segmentation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 5471-5480. [65] FAN J S, ZHANG Z X.Memory-Based Cross-Image Contexts for Weakly Supervised Semantic Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(5): 6006-6020. [66] MAO B J, ZHANG X B, WANG L F, et al.Learning from the Target: Dual Prototype Network for Few Shot Semantic Segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(2): 1953-1961. [67] YANG J H, HUANG Y, NIU K, et al.Actor and Action Modular Network for Text-Based Video Segmentation. IEEE Transactions on Image Processing, 2022, 31: 4474-4489. [68] SHABAN A, BANSAL S, LIU Z, et al. One-Shot Learning for Semantic Segmentation[C/OL].[2023-09-20]. https://arxiv.org/pdf/1709.03410.pdf. [69] EVERINGHAM M, WINN J. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Development Kit[C/OL]. [2023-09-20]. http://host.robots.ox.ac.uk/pascal/VOC/voc2012/devkit_doc.pdf. [70] XU C L, HSIEH S H, XIONG C M, et al.Can Humans Fly? Action Understanding with Multiple Classes of Actors // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2015: 2264-2273. [71] LIN G S, MILAN A, SHEN C H, et al.RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 5168-5177. [72] LI X Y, WANG Z H, YANG J H, et al.KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 2583-2592. [73] LI X, ZHANG Y, YUAN W L, et al.Incorporating External Know-ledge Reasoning for Vision-and-Language Navigation with Assis-tant's Help. Applied Sciences, 2022, 12(14). DOI: 10.3390/app12147053. [74] LIN C, JIANG Y, CAI J F, et al.Multimodal Transformer with Variable-Length Memory for Vision-and-Language Navigation // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2022: 380-397. [75] LIN B Q, ZHU Y, CHEN Z C, et al.Adapt: Vision-Language Navigation with Modality-Aligned Action Prompts // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 15375-15385. [76] AN D, QI Y K, LI Y G, et al. BEVBert: Topo-Metric Map Pre-training for Language-Guided Navigation[C/OL].[2023-09-20]. https://arxiv.org/pdf/2212.04385v2.pdf. [77] ANDERSON P, WU Q, TENEY D, et al.Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 3674-3683. [78] QI Y K, WU Q, ANDERSON P, et al.REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 9979-9988. [79] CHEN S Z, GUHUR P L, SCHMID C, et al. History Aware Multimodal Transformer for Vision-and-Language Navigation[C/OL]. [2023-09-20]. https://inria.hal.science/hal-03464975/file/hamt_paper.pdf.