Adversarial Cross-Modal Retrieval Based on Association Constraint
GUO Qian1,3, QIAN Yuhua1,2,3, LIANG Xinyan1,3
1. Institute of Big Data Science and Industry, Shanxi University, Taiyuan 030006 2. Key Laboratory Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan 030006 3. School of Computer and Information Technology, Shanxi University, Taiyuan 030006
Abstract:In the existing cross-modal retrieval methods, retrieval results are obtained via the subspace acquired by a certain index constraint such as distance or similarity. Since the subspaces are learned with different index constraints, retrieval results are different. To improve the robustness of common subspace, a method for adversarial cross-modal retrieval based on association constraint is proposed. The consistency of different modality features is improved by the adversarial constraint to make the discriminator in the constraint unable to distinguish which modality the subspace features come from. The association of different modality features is enhanced by the association constraint. The structural information between example pairs with the same semantics of different modalities and different semantics of the same modality is taken into account by the triple loss constraint. Experimental results on datasets show that the proposed method is more effective than other cross-modal retrieval methods.
[1] HARDOON D R, SZEDMAK S, SHAWE-TAYLOR J. Canonical Correlation Analysis: An Overview with Application to Learning Methods. Neural Computation, 2004, 16(12): 2639-2664. [2] 张 鸿,吴 飞,庄越挺.基于特征子空间学习的跨媒体检索方法.模式识别与人工智能, 2008, 21(6): 739-745. (ZHANG H, WU F, ZHUANG Y T. Cross-Media Retrieval Method Based on Feature Subspace Learning. Pattern Recognition and Artificial Intelligence, 2008, 21(6): 739-745.) [3] 庄 凌,王 超,周 峰,等.相关空间嵌入算法及其在图像检索中的应用.模式识别与人工智能, 2014, 27(4): 363-371. (ZHUANG L, WANG C, ZHOU F, et al. Correlation Space Embedding Algorithm and Its Application to Image Retrieval. Pattern Recognition and Artificial Intelligence, 2014, 27(4): 363-371.) [4] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet Classification with Deep Convolutional Neural Networks // Proc of the 25th International Conference on Neural Information Processing Systems. Cambridge, USA: The MIT Press, 2012, I: 1097-1105. [5] LECUN Y, BENGIO Y, HINTON G E, et al. Deep Learning. Nature, 2015, 521(7553): 436-444. [6] 李 钦,游 雄,李 科,等.图像深度层次特征提取算法.模式识别与人工智能, 2017, 30(2): 127-136. (LI Q, YOU X, LI K, et al. Deep Hierarchical Feature Extraction Algorithm. Pattern Recognition and Artificial Intelligence, 2017, 30(2): 127-136.) [7] FENG F X, WANG X J, LI R F, et al. Cross-Modal Retrieval with Correspondence Autoencoder // Proc of the 22nd ACM International Conference on Multimedia. New York, USA: ACM, 2014: 7-16. [8] PENG Y X, HUANG X, QI J W, et al. Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks // Proc of the 25th International Joint Conference on Artificial Intelligence. New York, USA: ACM, 2016: 3846-3853. [9] WANG K Y, HE R, WANG W, et al. Learning Coupled Feature Spaces for Cross-Modal Matching // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2013: 2088-2095. [10] 王科俊,马 慧,管凤旭,等.基于图像采集质量评价的指纹与指静脉双模态识别决策级融合方法.模式识别与人工智能, 2012, 25(4): 669-675. (WANG K J, MA H, GUAN F X, et al. Dual-Modal Decision Fusion for Fingerprint and Finger Vein Recognition Based on Image Capture Quality Evaluation. Pattern Recognition and Artificial Intelligence, 2012, 25(4): 669-675.) [11] ANDREW G, ARORA R, BILMES J, et al. Deep Canonical Correlation Analysis // Proc of the 30th International Conference on Machine Learning. New York, USA: ACM, 2013, III: 1247-1255. [12] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative Adversarial Nets // Proc of the 27th International Confe-rence on Neural Information Processing Systems. Cambridge, USA: The MIT Press, 2014, II: 2672-2680. [13] WANG B K, YANG Y, XU X, et al. Adversarial Cross-Modal Retrieval // Proc of the 25th ACM International Conference on Multimedia. New York, USA: ACM, 2017: 154-162. [14] 钱宇华,张明星,成红红.关联学习:关联关系挖掘新视角.计算机研究与发展, 2020, 57(2): 424-432. (QIAN Y H, ZHANG M X, CHENG H H, et al. Association Learning: A New Perspective of Mining Association. Journal of Computer Research and Development, 2020, 57(2): 424-432.) [15] 成红红,钱宇华,胡治国,等.基于邻域视角的关联关系挖掘方法.中国科学(信息科学), 2020, 50(6): 824-844. (CHENG H H, QIAN Y H, HU Z G, et al. Association Mining Method Based on Neighborhood Perspective. Scientia Sinica Informationis, 2020, 50(6): 824-844.) [16] SIMONYAN K, ZISSERMAN A. Very Deep Convolutional Networks for Large-Scale Image Recognition[C/OL]. [2020-09-22]. https://arxiv.org/pdf/1409.1556.pdf. [17] SCHROFF F, KALENICHENKO D, PHILBIN J, et al. FaceNet: A Unified Embedding for Face Recognition and Clustering // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2015: 815-823. [18] CHENG D, GONG Y H, ZHOU S P, et al. Person Re-identification by Multi-channel Parts-Based CNN with Improved Triplet Loss Function // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 1335-1344. [19] PEREIRA J C, COVIELLO E, DOYLE G, et al. On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(3): 521-535. [20] CHUA T S, TANG J H, HONG R C, et al. NUS-WIDE: A Real-World Web Image Database from National University of Singapore // Proc of the 8th ACM International Conference on Image and Video Retrieval. New York, USA: ACM, 2009: 368-375. [21] RASHTCHIAN C, YOUNG P, HODOSH M, et al. Collecting Image Annotations Using Amazon′s Mechanical Turk // Proc of the NAACL HLT Workshop on Creating Speech and Language Data with Amazon′s Mechanical Turk. New York, USA: ACM, 2010: 139-147. [22] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common Objects in Context // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2014: 740-755. [23] GONG Y C, KE Q F, ISARD M, et al. A Multi-view Embedding Space for Modeling Internet Images, Tags, and Their Semantics. International Journal of Computer Vision, 2014, 106(2): 210-233. [24] ZHAI X H, PENG Y X, XIAO J G, et al. Learning Cross-Media Joint Representation with Sparse and Semisupervised Regularization. IEEE Transactions on Circuits and Systems for Video Technology, 2014, 24(6): 965-978. [25] WANG K Y, HE R, WANG L, et al. Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38(10): 2010-2023. [26] SRIVASTAVA N, SALAKHUTDINOV R. Learning Representations for Multimodal Data with Deep Belief Nets // Proc of the International Conference on Machine Learning workshop. New York, USA: ACM, 2012: 79-86. [27] NGIAM J, KHOSL A, KIM M, et al. Multimodal Deep Learning // Proc of the 28th International Conference on Machine Learning. New York, USA: ACM, 2011: 689-696. [28] PENG Y X, QI J W, YUAN Y X. Modality-Specific Cross-Modal Similarity Measurement with Recurrent Attention Network. IEEE Transactions on Image Processing, 2018, 27(11): 5585-5599. [29] OU W H, XUAN R S, GOU J P, et al. Semantic Consistent Adversarial Cross-Modal Retrieval Exploiting Semantic Similarity. Multimedia Tools and Applications, 2020, 79: 14733-14750. [30] XU X, SONG J K, LU H M, et al. Modal-Adversarial Semantic Learning Network for Extendable Cross-Modal Retrieval // Proc of the ACM International Conference on Multimedia Retrieval. New York, USA: ACM, 2018: 46-54. [31] KLEIN B, LEV G, SADEH G, et al. Associating Neural Word Embeddings with Deep Image Representations Using Fisher Vectors // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2015: 4437-4446. [32] KARPATHY A, LI F F. Deep Visual-Semantic Alignments for Generating Image Descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 664-676. [33] MA L, LU Z D, SHANG L F, et al. Multimodal Convolutional Neural Networks for Matching Image and Sentence // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2015: 2623-2631. [34] WANG L W, LI Y, LAZEBNIK S. Learning Deep Structure-Preserving Image-Text Embeddings // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 5005-5013. [35] WEHRMANN J, MATTJIE A, BARROS R C, et al. Order Embeddings and Character-Level Convolutions for Multimodal Alignment. Pattern Recognition Letters, 2018, 102: 15-22.