Cross-Media Fine-Grained Representation Learning Based on Multi-modal Graph and Adversarial Hash Attention Network
LIANG Meiyu1, WANG Xiaoxiao1, DU Junping1
1. Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia, School of Computer Science(National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing 100876
Abstract:There are problems of feature heterogeneity and semantic gap between data of different media types in cross-media data search, and social network data often exhibits semantic sparsity and diversity. Aiming at these problems, a cross-media fine-grained representation learning model based on multi-modal graph and adversarial Hash attention network(CMFAH) is proposed to obtain a unified cross-media semantic representation and applied to social network cross-media search. Firstly, an image-word association graph is constructed, and direct and implicit semantic associations between image and text words are mined based on the graph random walk strategy to expand the semantic relationship. A cross-media fine-grained feature learning network based on cross-media attention is constructed, and the fine-grained semantic association between images and texts is learned collaboratively through the cross-media attention mechanism. A cross-media adversarial hash network is constructed, and an efficient and compact cross-media unified hash semantic representation is obtained by the joint cross-media fine-grained semantic association learning and adversarial hash learning. Experimental results show that CMFAH achieves better cross-media search performance on two benchmark cross-media datasets.
[1] PENG Y X, HUANG X, ZHAO Y Z.An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28(9): 2372-2385. [2] 王树徽,闫旭,黄庆明.跨媒体分析与推理技术研究综述.计算机科学, 2021, 48(3): 79-86. (WANG S H, YAN X, HUANG Q M.Overview of Research on Cross-Media Analysis and Reasoning Technology. Computer Science, 2021, 48(3): 79-86.) [3] HU Y T, LIANG L, YANG Y, et al. Twitter100k: A Real-World Dataset for Weakly Supervised Cross-Media Retrieval. IEEE Tran-sactions on Multimedia, 2018, 20(4): 927-938. [4] LIANG M Y, DU J P, YANG C X, et al. Cross-Media Semantic Correlation Learning Based on Deep Hash Network and Semantic Expansion for Social Network Cross-Media Search. IEEE Transactions on Neural Networks and Learning Systems, 2020, 31(9): 3634-3648. [5] 陈卓,杜昊,吴雨菲,等.基于视觉-文本关系对齐的跨模态视频片段检索.中国科学(信息科学), 2020, 50(6): 862-876. (CHEN Z, DU H, WU Y F, et al. Cross-Modal Video Moment Retrieval Based on Visual-Textual Relationship Alignment. Scientia Sinica(Informationis), 2020, 50(6): 862-876.) [6] FENG F X, WANG X J, LI R F.Cross-Modal Retrieval with Correspondence Autoencoder // Proc of the 22nd ACM International Conference on Multimedia. New York, USA: ACM, 2014: 7-16. [7] PENG Y X, QI J W.CM-GANs: Cross-Modal Generative Adversa-rial Networks for Common Representation Learning. ACM Transactions on Multimedia Computing, Communications and Applications, 2019, 15(1): 1-24. [8] WANG B K, YANG Y, XU X, et al. Adversarial Cross-Modal Retrieval // Proc of the 25th ACM International Conference on Multimedia. New York, USA: ACM, 2017: 154-162. [9] WEI Y C, ZHAO Y, LU C Y, et al. Cross-Modal Retrieval with CNN Visual Features: A New Baseline. IEEE Transactions on Cybernetics, 2017, 47(2): 449-460. [10] XU R Q, LI C, YAN J C, et al. Graph Convolutional Network Hashing for Cross-Modal Retrieval // Proc of the 28th International Joint Conferences on Artificial Intelligence. San Francisco, USA: IJCAI, 2019: 982-988. [11] WANG D, GAO X B, WANG X M, et al. Label Consistent Matrix Factorization Hashing for Large-Scale Cross-Modal Similarity Search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(10): 2466-2479. [12] ZHANG J, PENG Y X, YUAN M K.Unsupervised Generative Adversarial Cross-Modal Hashing // Proc of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI, 2018: 539-546. [13] GU W, GU X Y, GU J Z, et al. Adversary Guided Asymmetric Hashing for Cross-Modal Retrieval // Proc of the International Conference on Multimedia Retrieval. New York,USA: ACM, 2019: 159-167. [14] LI C, DENG C, WANG L, et al. Coupled CycleGAN: Unsupervised Hashing Network for Cross-Modal Retrieval. Proceeding of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 176-183. [15] CAO Y, LONG M S, WANG J M, et al. Collective Deep Quantization for Efficient Cross-Modal Retrieval // Proc of the 31st AAAI Conference on Artificial Intelligence. Palo Alto,USA: AAAI, 2017: 3974-3980. [16] LI C, DENG C, LI N, et al. Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington,USA: IEEE, 2018: 4242-4251. [17] LIN Z J, DING G G, HU M Q, et al. Semantics-Preserving Ha-shing for Cross-View Retrieval // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington,USA: IEEE, 2015: 3864-3872. [18] ZHANG D Q, LI W J.Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization // Proc of the 28th AAAI Conference on Artificial Intelligence. Palo Alto,USA: AAAI, 2014: 2177-2183. [19] BRONSTEIN M M, BRONSTEIN A M, MICHEL F, et al. Data Fusion through Cross-Modality Metric Learning Using Similarity-Sensitive Hashing // Proc of the IEEE Computer Society Confe-rence on Computer Vision and Pattern Recognition. Washington,USA: IEEE, 2010: 3594-3601. [20] PENG Y X, QI J W, YUAN Y X.Modality-Specific Cross-Modal Similarity Measurement with Recurrent Attention Network. IEEE Transactions on Image Processing, 2018, 27(11): 5585-5599. [21] ZHUANG Y T, ZHOU Y, WANG W, et al. Cross-Media Hashing with Neural Networks // Proc of the 22nd ACM International Conference on Multimedia. New York,USA: ACM, 2014: 901-904. [22] JIANG Q Y, LI W J.Deep Cross-Modal Hashing // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington,USA: IEEE, 2017: 3270-3278. [23] SHI Y F, YOU X G, ZHENG F, et al. Equally-Guided Discriminative Hashing for Cross-Modal Retrieval // Proc of the 28th International Joint Conference on Artificial Intelligence. San Francisco,USA: IJCAI, 2019: 4767-4773. [24] ZHANG X, ZHOU S Y, FENG J S, et al. HashGAN: Attention-Aware Deep Adversarial Hashing for Cross Modal Retrieval[C/OL].[2021-04-17]. https://arxiv.org/pdf/1711.09347v1.pdf. [25] HU Z B, LUO Y S, LIN J, et al. Multi-level Visual-Semantic Alignments with Relation-Wise Dual Attention Network for Image and Text Matching // Proc of the 28th International Joint Conference on Artificial Intelligence. San Francisco,USA: IJCAI, 2019: 789-795. [26] QI J W, PENG Y X, YUAN Y X.Cross-Media Multi-level Alignment with Relation Attention Network // Proc of the 27th International Joint Conference on Artificial Intelligence. San Francisco,USA: IJCAI, 2018: 892-898. [27] PENG Y X, QI J W, ZHUO Y K.MAVA: Multi-level Adaptive Visual-Textual Alignment by Cross-Media Bi-attention Mechanism. IEEE Transactions on Image Processing, 2020, 29: 2728-2741. [28] 卓昀侃,綦金玮,彭宇新.跨媒体深层细粒度关联学习方法.软件学报, 2019, 30(4): 884-895. (ZHUO Y K, QI J W, PENG Y X.Cross-Media Deep Fine-Grained Correlation Learning. Journal of Software, 2019, 30(4): 884-895.) [29] HE X T, PENG Y X.Fine-Grained Visual-Textual Representation Learning. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(2): 520-531. [30] LEE K H, CHEN X, HUA G, et al. Stacked Cross Attention for Image-Text Matching // Proc of the European Conference on Computer Vision. Berlin,Germany: Springer, 2018: 212-228. [31] WANG S H, CHEN Y Y, ZHUO J B, et al. Joint Global and Co-attentive Representation Learning for Image-Sentence Retrieval // Proc of the 26th ACM International Conference on Multimedia. New York,USA: ACM, 2018: 1398-1406. [32] CHI J Z, PENG Y X.Zero-Shot Cross-Media Embedding Learning with Dual Adversarial Distribution Network. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(4): 1173-1187. [33] MA X H, ZHANG T Z, XU C S.Multi-level Correlation Adversa-rial Hashing for Cross-Modal Retrieval. IEEE Transactions on Mul-timedia, 2020, 22(12): 3101-3114. [34] LIU J W, ZHA Z J, HONG R C, et al. Deep Adversarial Graph Attention Convolution Network for Text-Based Person Search // Proc of the 27th ACM International Conference on Multimedia. New York,USA: ACM, 2019: 665-673. [35] ZHANG H W, SHANG X D, LUAN H B, et al. Learning Features from Large-Scale, Noisy and Social Image-Tag Collection // Proc of the 23rd ACM International Conference on Multimedia. New York,USA: ACM, 2015: 1079-1082.