基于多模态图和对抗哈希注意力网络的跨媒体细粒度表示学习

doi:10.16451/j.cnki.issn1003-6059.202203001

Abstract
Figure/Table
References
Related Citation (15)

Download: PDF (732 KB) HTML (1 KB)
Export: BibTeX | EndNote (RIS)

Abstract There are problems of feature heterogeneity and semantic gap between data of different media types in cross-media data search, and social network data often exhibits semantic sparsity and diversity. Aiming at these problems, a cross-media fine-grained representation learning model based on multi-modal graph and adversarial Hash attention network(CMFAH) is proposed to obtain a unified cross-media semantic representation and applied to social network cross-media search. Firstly, an image-word association graph is constructed, and direct and implicit semantic associations between image and text words are mined based on the graph random walk strategy to expand the semantic relationship. A cross-media fine-grained feature learning network based on cross-media attention is constructed, and the fine-grained semantic association between images and texts is learned collaboratively through the cross-media attention mechanism. A cross-media adversarial hash network is constructed, and an efficient and compact cross-media unified hash semantic representation is obtained by the joint cross-media fine-grained semantic association learning and adversarial hash learning. Experimental results show that CMFAH achieves better cross-media search performance on two benchmark cross-media datasets.

Key words： Cross-Media Representation Learning Adversarial Hash Attention Network Fine-Grained Representation Learning Cross-Media Collaborative Attention Mechanism Cross-Media Search

Received: 28 April 2021

ZTFLH:

TP 391

Fund:Key Research and development Program of China(No.2018YFB1402600), National Natural Science Foundation of China(No.61877006,62192784), CAAI-Huawei MindSpore Open Fund(No.S2021264)

Corresponding Authors: DU Junping, Ph.D., professor. Her research interests include artificial intelligence, machine learning and pa-ttern recognition.

About author:: LIANG Meiyu, Ph.D., associate profe-ssor. Her research interests include artificial intelligence, data mining, multimedia information processing and computer vision.
WANG Xiaoxiao, master. Her research interests include cross-media semantic learning and search, and deep learning.

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	LIANG Meiyu
	WANG Xiaoxiao
	DU Junping

Cite this article:

LIANG Meiyu,WANG Xiaoxiao,DU Junping. Cross-Media Fine-Grained Representation Learning Based on Multi-modal Graph and Adversarial Hash Attention Network[J]. Pattern Recognition and Artificial Intelligence, 2022, 35(3): 195-206.

URL:

http://manu46.magtech.com.cn/Jweb_prai/EN/10.16451/j.cnki.issn1003-6059.202203001 OR http://manu46.magtech.com.cn/Jweb_prai/EN/Y2022/V35/I3/195

[1] PENG Y X, HUANG X, ZHAO Y Z.An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28(9): 2372-2385.
[2] 王树徽,闫旭,黄庆明.跨媒体分析与推理技术研究综述.计算机科学, 2021, 48(3): 79-86.
(WANG S H, YAN X, HUANG Q M.Overview of Research on Cross-Media Analysis and Reasoning Technology. Computer Science, 2021, 48(3): 79-86.)
[3] HU Y T, LIANG L, YANG Y, et al. Twitter100k: A Real-World Dataset for Weakly Supervised Cross-Media Retrieval. IEEE Tran-sactions on Multimedia, 2018, 20(4): 927-938.
[4] LIANG M Y, DU J P, YANG C X, et al. Cross-Media Semantic Correlation Learning Based on Deep Hash Network and Semantic Expansion for Social Network Cross-Media Search. IEEE Transactions on Neural Networks and Learning Systems, 2020, 31(9): 3634-3648.
[5] 陈卓,杜昊,吴雨菲,等.基于视觉-文本关系对齐的跨模态视频片段检索.中国科学(信息科学), 2020, 50(6): 862-876.
(CHEN Z, DU H, WU Y F, et al. Cross-Modal Video Moment Retrieval Based on Visual-Textual Relationship Alignment. Scientia Sinica(Informationis), 2020, 50(6): 862-876.)
[6] FENG F X, WANG X J, LI R F.Cross-Modal Retrieval with Correspondence Autoencoder // Proc of the 22nd ACM International Conference on Multimedia. New York, USA: ACM, 2014: 7-16.
[7] PENG Y X, QI J W.CM-GANs: Cross-Modal Generative Adversa-rial Networks for Common Representation Learning. ACM Transactions on Multimedia Computing, Communications and Applications, 2019, 15(1): 1-24.
[8] WANG B K, YANG Y, XU X, et al. Adversarial Cross-Modal Retrieval // Proc of the 25th ACM International Conference on Multimedia. New York, USA: ACM, 2017: 154-162.
[9] WEI Y C, ZHAO Y, LU C Y, et al. Cross-Modal Retrieval with CNN Visual Features: A New Baseline. IEEE Transactions on Cybernetics, 2017, 47(2): 449-460.
[10] XU R Q, LI C, YAN J C, et al. Graph Convolutional Network Hashing for Cross-Modal Retrieval // Proc of the 28th International Joint Conferences on Artificial Intelligence. San Francisco, USA: IJCAI, 2019: 982-988.
[11] WANG D, GAO X B, WANG X M, et al. Label Consistent Matrix Factorization Hashing for Large-Scale Cross-Modal Similarity Search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(10): 2466-2479.
[12] ZHANG J, PENG Y X, YUAN M K.Unsupervised Generative Adversarial Cross-Modal Hashing // Proc of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI, 2018: 539-546.
[13] GU W, GU X Y, GU J Z, et al. Adversary Guided Asymmetric Hashing for Cross-Modal Retrieval // Proc of the International Conference on Multimedia Retrieval. New York,USA: ACM, 2019: 159-167.
[14] LI C, DENG C, WANG L, et al. Coupled CycleGAN: Unsupervised Hashing Network for Cross-Modal Retrieval. Proceeding of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 176-183.
[15] CAO Y, LONG M S, WANG J M, et al. Collective Deep Quantization for Efficient Cross-Modal Retrieval // Proc of the 31st AAAI Conference on Artificial Intelligence. Palo Alto,USA: AAAI, 2017: 3974-3980.
[16] LI C, DENG C, LI N, et al. Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington,USA: IEEE, 2018: 4242-4251.
[17] LIN Z J, DING G G, HU M Q, et al. Semantics-Preserving Ha-shing for Cross-View Retrieval // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington,USA: IEEE, 2015: 3864-3872.
[18] ZHANG D Q, LI W J.Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization // Proc of the 28th AAAI Conference on Artificial Intelligence. Palo Alto,USA: AAAI, 2014: 2177-2183.
[19] BRONSTEIN M M, BRONSTEIN A M, MICHEL F, et al. Data Fusion through Cross-Modality Metric Learning Using Similarity-Sensitive Hashing // Proc of the IEEE Computer Society Confe-rence on Computer Vision and Pattern Recognition. Washington,USA: IEEE, 2010: 3594-3601.
[20] PENG Y X, QI J W, YUAN Y X.Modality-Specific Cross-Modal Similarity Measurement with Recurrent Attention Network. IEEE Transactions on Image Processing, 2018, 27(11): 5585-5599.
[21] ZHUANG Y T, ZHOU Y, WANG W, et al. Cross-Media Hashing with Neural Networks // Proc of the 22nd ACM International Conference on Multimedia. New York,USA: ACM, 2014: 901-904.
[22] JIANG Q Y, LI W J.Deep Cross-Modal Hashing // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington,USA: IEEE, 2017: 3270-3278.
[23] SHI Y F, YOU X G, ZHENG F, et al. Equally-Guided Discriminative Hashing for Cross-Modal Retrieval // Proc of the 28th International Joint Conference on Artificial Intelligence. San Francisco,USA: IJCAI, 2019: 4767-4773.
[24] ZHANG X, ZHOU S Y, FENG J S, et al. HashGAN: Attention-Aware Deep Adversarial Hashing for Cross Modal Retrieval[C/OL].[2021-04-17]. https://arxiv.org/pdf/1711.09347v1.pdf.
[25] HU Z B, LUO Y S, LIN J, et al. Multi-level Visual-Semantic Alignments with Relation-Wise Dual Attention Network for Image and Text Matching // Proc of the 28th International Joint Conference on Artificial Intelligence. San Francisco,USA: IJCAI, 2019: 789-795.
[26] QI J W, PENG Y X, YUAN Y X.Cross-Media Multi-level Alignment with Relation Attention Network // Proc of the 27th International Joint Conference on Artificial Intelligence. San Francisco,USA: IJCAI, 2018: 892-898.
[27] PENG Y X, QI J W, ZHUO Y K.MAVA: Multi-level Adaptive Visual-Textual Alignment by Cross-Media Bi-attention Mechanism. IEEE Transactions on Image Processing, 2020, 29: 2728-2741.
[28] 卓昀侃,綦金玮,彭宇新.跨媒体深层细粒度关联学习方法.软件学报, 2019, 30(4): 884-895.
(ZHUO Y K, QI J W, PENG Y X.Cross-Media Deep Fine-Grained Correlation Learning. Journal of Software, 2019, 30(4): 884-895.)
[29] HE X T, PENG Y X.Fine-Grained Visual-Textual Representation Learning. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(2): 520-531.
[30] LEE K H, CHEN X, HUA G, et al. Stacked Cross Attention for Image-Text Matching // Proc of the European Conference on Computer Vision. Berlin,Germany: Springer, 2018: 212-228.
[31] WANG S H, CHEN Y Y, ZHUO J B, et al. Joint Global and Co-attentive Representation Learning for Image-Sentence Retrieval // Proc of the 26th ACM International Conference on Multimedia. New York,USA: ACM, 2018: 1398-1406.
[32] CHI J Z, PENG Y X.Zero-Shot Cross-Media Embedding Learning with Dual Adversarial Distribution Network. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(4): 1173-1187.
[33] MA X H, ZHANG T Z, XU C S.Multi-level Correlation Adversa-rial Hashing for Cross-Modal Retrieval. IEEE Transactions on Mul-timedia, 2020, 22(12): 3101-3114.
[34] LIU J W, ZHA Z J, HONG R C, et al. Deep Adversarial Graph Attention Convolution Network for Text-Based Person Search // Proc of the 27th ACM International Conference on Multimedia. New York,USA: ACM, 2019: 665-673.
[35] ZHANG H W, SHANG X D, LUAN H B, et al. Learning Features from Large-Scale, Noisy and Social Image-Tag Collection // Proc of the 23rd ACM International Conference on Multimedia. New York,USA: ACM, 2015: 1079-1082.