|
|
Cross-Media Fine-Grained Representation Learning Based on Multi-modal Graph and Adversarial Hash Attention Network |
LIANG Meiyu1, WANG Xiaoxiao1, DU Junping1 |
1. Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia, School of Computer Science(National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing 100876 |
|
|
Abstract There are problems of feature heterogeneity and semantic gap between data of different media types in cross-media data search, and social network data often exhibits semantic sparsity and diversity. Aiming at these problems, a cross-media fine-grained representation learning model based on multi-modal graph and adversarial Hash attention network(CMFAH) is proposed to obtain a unified cross-media semantic representation and applied to social network cross-media search. Firstly, an image-word association graph is constructed, and direct and implicit semantic associations between image and text words are mined based on the graph random walk strategy to expand the semantic relationship. A cross-media fine-grained feature learning network based on cross-media attention is constructed, and the fine-grained semantic association between images and texts is learned collaboratively through the cross-media attention mechanism. A cross-media adversarial hash network is constructed, and an efficient and compact cross-media unified hash semantic representation is obtained by the joint cross-media fine-grained semantic association learning and adversarial hash learning. Experimental results show that CMFAH achieves better cross-media search performance on two benchmark cross-media datasets.
|
Received: 28 April 2021
|
|
Fund:Key Research and development Program of China(No.2018YFB1402600), National Natural Science Foundation of China(No.61877006,62192784), CAAI-Huawei MindSpore Open Fund(No.S2021264) |
Corresponding Authors:
DU Junping, Ph.D., professor. Her research interests include artificial intelligence, machine learning and pa-ttern recognition.
|
About author:: LIANG Meiyu, Ph.D., associate profe-ssor. Her research interests include artificial intelligence, data mining, multimedia information processing and computer vision. WANG Xiaoxiao, master. Her research interests include cross-media semantic learning and search, and deep learning. |
|
|
|
[1] PENG Y X, HUANG X, ZHAO Y Z.An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28(9): 2372-2385. [2] 王树徽,闫旭,黄庆明.跨媒体分析与推理技术研究综述.计算机科学, 2021, 48(3): 79-86. (WANG S H, YAN X, HUANG Q M.Overview of Research on Cross-Media Analysis and Reasoning Technology. Computer Science, 2021, 48(3): 79-86.) [3] HU Y T, LIANG L, YANG Y, et al. Twitter100k: A Real-World Dataset for Weakly Supervised Cross-Media Retrieval. IEEE Tran-sactions on Multimedia, 2018, 20(4): 927-938. [4] LIANG M Y, DU J P, YANG C X, et al. Cross-Media Semantic Correlation Learning Based on Deep Hash Network and Semantic Expansion for Social Network Cross-Media Search. IEEE Transactions on Neural Networks and Learning Systems, 2020, 31(9): 3634-3648. [5] 陈卓,杜昊,吴雨菲,等.基于视觉-文本关系对齐的跨模态视频片段检索.中国科学(信息科学), 2020, 50(6): 862-876. (CHEN Z, DU H, WU Y F, et al. Cross-Modal Video Moment Retrieval Based on Visual-Textual Relationship Alignment. Scientia Sinica(Informationis), 2020, 50(6): 862-876.) [6] FENG F X, WANG X J, LI R F.Cross-Modal Retrieval with Correspondence Autoencoder // Proc of the 22nd ACM International Conference on Multimedia. New York, USA: ACM, 2014: 7-16. [7] PENG Y X, QI J W.CM-GANs: Cross-Modal Generative Adversa-rial Networks for Common Representation Learning. ACM Transactions on Multimedia Computing, Communications and Applications, 2019, 15(1): 1-24. [8] WANG B K, YANG Y, XU X, et al. Adversarial Cross-Modal Retrieval // Proc of the 25th ACM International Conference on Multimedia. New York, USA: ACM, 2017: 154-162. [9] WEI Y C, ZHAO Y, LU C Y, et al. Cross-Modal Retrieval with CNN Visual Features: A New Baseline. IEEE Transactions on Cybernetics, 2017, 47(2): 449-460. [10] XU R Q, LI C, YAN J C, et al. Graph Convolutional Network Hashing for Cross-Modal Retrieval // Proc of the 28th International Joint Conferences on Artificial Intelligence. San Francisco, USA: IJCAI, 2019: 982-988. [11] WANG D, GAO X B, WANG X M, et al. Label Consistent Matrix Factorization Hashing for Large-Scale Cross-Modal Similarity Search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(10): 2466-2479. [12] ZHANG J, PENG Y X, YUAN M K.Unsupervised Generative Adversarial Cross-Modal Hashing // Proc of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI, 2018: 539-546. [13] GU W, GU X Y, GU J Z, et al. Adversary Guided Asymmetric Hashing for Cross-Modal Retrieval // Proc of the International Conference on Multimedia Retrieval. New York,USA: ACM, 2019: 159-167. [14] LI C, DENG C, WANG L, et al. Coupled CycleGAN: Unsupervised Hashing Network for Cross-Modal Retrieval. Proceeding of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 176-183. [15] CAO Y, LONG M S, WANG J M, et al. Collective Deep Quantization for Efficient Cross-Modal Retrieval // Proc of the 31st AAAI Conference on Artificial Intelligence. Palo Alto,USA: AAAI, 2017: 3974-3980. [16] LI C, DENG C, LI N, et al. Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington,USA: IEEE, 2018: 4242-4251. [17] LIN Z J, DING G G, HU M Q, et al. Semantics-Preserving Ha-shing for Cross-View Retrieval // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington,USA: IEEE, 2015: 3864-3872. [18] ZHANG D Q, LI W J.Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization // Proc of the 28th AAAI Conference on Artificial Intelligence. Palo Alto,USA: AAAI, 2014: 2177-2183. [19] BRONSTEIN M M, BRONSTEIN A M, MICHEL F, et al. Data Fusion through Cross-Modality Metric Learning Using Similarity-Sensitive Hashing // Proc of the IEEE Computer Society Confe-rence on Computer Vision and Pattern Recognition. Washington,USA: IEEE, 2010: 3594-3601. [20] PENG Y X, QI J W, YUAN Y X.Modality-Specific Cross-Modal Similarity Measurement with Recurrent Attention Network. IEEE Transactions on Image Processing, 2018, 27(11): 5585-5599. [21] ZHUANG Y T, ZHOU Y, WANG W, et al. Cross-Media Hashing with Neural Networks // Proc of the 22nd ACM International Conference on Multimedia. New York,USA: ACM, 2014: 901-904. [22] JIANG Q Y, LI W J.Deep Cross-Modal Hashing // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington,USA: IEEE, 2017: 3270-3278. [23] SHI Y F, YOU X G, ZHENG F, et al. Equally-Guided Discriminative Hashing for Cross-Modal Retrieval // Proc of the 28th International Joint Conference on Artificial Intelligence. San Francisco,USA: IJCAI, 2019: 4767-4773. [24] ZHANG X, ZHOU S Y, FENG J S, et al. HashGAN: Attention-Aware Deep Adversarial Hashing for Cross Modal Retrieval[C/OL].[2021-04-17]. https://arxiv.org/pdf/1711.09347v1.pdf. [25] HU Z B, LUO Y S, LIN J, et al. Multi-level Visual-Semantic Alignments with Relation-Wise Dual Attention Network for Image and Text Matching // Proc of the 28th International Joint Conference on Artificial Intelligence. San Francisco,USA: IJCAI, 2019: 789-795. [26] QI J W, PENG Y X, YUAN Y X.Cross-Media Multi-level Alignment with Relation Attention Network // Proc of the 27th International Joint Conference on Artificial Intelligence. San Francisco,USA: IJCAI, 2018: 892-898. [27] PENG Y X, QI J W, ZHUO Y K.MAVA: Multi-level Adaptive Visual-Textual Alignment by Cross-Media Bi-attention Mechanism. IEEE Transactions on Image Processing, 2020, 29: 2728-2741. [28] 卓昀侃,綦金玮,彭宇新.跨媒体深层细粒度关联学习方法.软件学报, 2019, 30(4): 884-895. (ZHUO Y K, QI J W, PENG Y X.Cross-Media Deep Fine-Grained Correlation Learning. Journal of Software, 2019, 30(4): 884-895.) [29] HE X T, PENG Y X.Fine-Grained Visual-Textual Representation Learning. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(2): 520-531. [30] LEE K H, CHEN X, HUA G, et al. Stacked Cross Attention for Image-Text Matching // Proc of the European Conference on Computer Vision. Berlin,Germany: Springer, 2018: 212-228. [31] WANG S H, CHEN Y Y, ZHUO J B, et al. Joint Global and Co-attentive Representation Learning for Image-Sentence Retrieval // Proc of the 26th ACM International Conference on Multimedia. New York,USA: ACM, 2018: 1398-1406. [32] CHI J Z, PENG Y X.Zero-Shot Cross-Media Embedding Learning with Dual Adversarial Distribution Network. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(4): 1173-1187. [33] MA X H, ZHANG T Z, XU C S.Multi-level Correlation Adversa-rial Hashing for Cross-Modal Retrieval. IEEE Transactions on Mul-timedia, 2020, 22(12): 3101-3114. [34] LIU J W, ZHA Z J, HONG R C, et al. Deep Adversarial Graph Attention Convolution Network for Text-Based Person Search // Proc of the 27th ACM International Conference on Multimedia. New York,USA: ACM, 2019: 665-673. [35] ZHANG H W, SHANG X D, LUAN H B, et al. Learning Features from Large-Scale, Noisy and Social Image-Tag Collection // Proc of the 23rd ACM International Conference on Multimedia. New York,USA: ACM, 2015: 1079-1082. |
|
|
|