|
|
|
| Image Captioning Based on Cross-Modal Prior Injection |
| JIANG Zetao1, ZHANG Luhao1, PAN Yiwei1, LI Mengtong1, YANG Jianchen1 |
| 1. School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004 |
|
|
|
|
Abstract Combining semantic information from text and image modalities is one of the key points of image captioning. However, existing image captioning methods often treat text information merely as the constraints in the decoding stage or simply concatenate and fuse text features with image features. As a result, insufficient cross-modal interaction between text and image is caused and a modality gap is created. Consequently, the semantic information contained in the text cannot be fully utilized in the encoding stage. To address this issue, a method for image captioning based on cross-modal prior injection(CMPI) is proposed. First, the textual prior knowledge is extracted through contrastive language-image pre-training(CLIP). Then, the textual prior knowledge is interacted with the modal medium for the first time, and the cross-modal features containing both textual and image semantic information are obtained. Finally, the second modal interaction is performed between the cross-modal features and the grid features of the image. With cross-modal features as a medium, the prior knowledge of the text is injected into the image features. In this way, the semantic information of the text is incorporated without damaging the structure of the image features ,and the modality gap is alleviated. Experimental results on Karpathy splits of MSCOCO dataset show that CMPI achieves a CIDEr score of 128.0 in the first training stage and 140.5 in the second training stage , demonstrating a clear advantage.
|
|
Received: 30 December 2024
|
|
|
| Fund:National Natural Science Foundation of China(No.62473105,62172118), Key Project of Natural Science Foundation of Guangxi Province(No.2021GXNSFDA196002), Program of Guangxi Key Laboratory of Image and Graphic Intelligent Processing(No.GIIP2302,GIIP2303,GIIP2304), Innovation Project of GUET Graduate Education(No.2024YCXB09,2024YCXS039,2024YCXS035,2023YCXS046) |
|
Corresponding Authors:
JIANG Zetao, Ph.D., professor. His research interests include image processing, computer vision and artificial intelligence.
|
| About author:: ZHANG Luhao, Master student. His research interests include computer vision and image description.PAN Yiwei, Master student. His research interests include computer vision and semantic segmentation.LI Mengtong, Master student. His research interests include computer vision and low-illumination image enhancement.YANG Jianchen, Master student. His research interests include computer vision and low-illumination image enhancement. |
|
|
|
[1] VINYALS O, TOSHEV A, BENGIO S, et al. Show and Tell: A Neural Image Caption Generator // Proc of the 28th IEEE Confe-rence on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2015: 3156-3164. [2] BIRADAR V G, MUKUND G, AGARWAL S, et al. Leveraging Deep Learning Model for Image Caption Generation for Scenes Description // Proc of the International Conference on Evolutionary Algorithms and Soft Computing Techniques. Washington, USA: IEEE, 2023. DOI: 10.1109/EASCT59475.2023.10393602. [3] VASWANI A, SHAZEER N, PARMAR N, et al.Attention Is All You Need // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 6000-6010. [4] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale[C/OL].[2024-12-08]. https://arxiv.org/pdf/2010.11929. [5] CORNIA M, STEFANINI M, BARALDI L, et al. Meshed-Memory Transformer for Image Captioning // Proc of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 10575-10584. [6] ANDERSON P, HE X D, BUEHLER C, et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering // Proc of the IEEE/CVF Conference on Computer Vision and Pa-ttern Recognition. Washington, USA: IEEE, 2018: 6077-6086. [7] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. [8] FANG Z Y, WANG J F, HU X W, et al. Injecting Semantic Concepts into End-to-End Image Captioning // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 17988-17998. [9] WEI T T, YUAN W L, LUO J R, et al. VLCA: Vision-Language Aligning Model with Cross-Modal Attention for Bilingual Remote Sensing Image Captioning. Journal of Systems Engineering and Electronics, 2023, 34(1): 9-18. [10] WANG Y Y, XU J G, SUN Y F.End-to-End Transformer Based Mo-del for Image Captioning. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(3): 2585-2594. [11] KUO C W, KIRA Z.Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 17948-17958. [12] RADFORD A, KIM J W, HALLACY C, et al. Learning Transfe-rable Visual Models from Natural Language Supervision. Procee-dings of Machine Learning Research, 2021, 139: 8748-8763. [13] XU K, BA J, KIROS R, et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Proceedings of Machine Learning Research, 2015, 37: 2048-2057. [14] HUANG L, WANG W M, CHEN J, et al. Attention on Attention for Image Captioning // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2019: 4633-4642. [15] PAN Y W, YAO T, LI Y H, et al. X-Linear Attention Networks for Image Captioning // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 10968-10977. [16] LUO Y P, JI J Y, SUN X S, et al. Dual-Level Collaborative Trans-former for Image Captioning. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(3): 2286-2293. [17] 刘兵,李穗,刘明明,等.基于全局与序列混合变分 Transformer 的多样化图像描述生成方法.电子学报, 2024, 52(4): 1305-1314. (LIU B, LI S, LIU M M, et al. Diverse Image Captioning Method Based on Hybrid Global and Sequential Variational Transformer. Acta Electronica Sinica, 2024, 52(4): 1305-1314.) [18] 李永杰,钱艺,文益民.基于外部先验和自先验注意力的图像描述生成方法.计算机科学, 2024, 51(7): 214-220. (LI Y J, QIAN Y, WEN Y M.Image Captioning Generation Method Based on External Prior and Self-Prior Attention. Computer Science, 2024, 51(7): 214-220.) [19] RENNIE S J, MARCHERET E, MROUEH Y, et al. Self-Critical Sequence Training for Image Captioning // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 1179-1195. [20] WANG P D, NG H T.A Beam-Search Decoder for Normalization of Social Media Text with Application to Machine Translation // Proc of the Conference of the North American Chapter of the Association for Computational Linguistics(Human Language Technologies). Stroudsburg, USA: ACL, 2013: 471-481. [21] KINGMA D P, BA J.Adam: A Method for Stochastic Optimization[C/OL]. [2024-12-08].https://arxiv.org/pdf/1412.6980. [22] YAO T, PAN Y W, LI Y H, et al. Exploring Visual Relationship for Image Captioning // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 711-727. [23] YANG X, TANG K H, ZHANG H W, et al. Auto-Encoding Scene Graphs for Image Captioning // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 10677-10686. [24] JI J Y, LUO Y P, SUN X S, et al. Improving Image Captioning by Leveraging Intra-and Inter-Layer Global Representation in Transformer Network. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(2): 1655-1663. [25] YANG X, GAO C Y, ZHANG H W, et al. Auto-Parsing Network for Image Captioning and Visual Question Answering // Proc of the IEEE/CVF International Conference on Computer Vision. Wa-shington, USA: IEEE, 2021: 2177-2187. [26] SHEN S, LI L H, TAN H, et al. How Much Can CLIP Benefit Vision-and-Language Tasks?[C/OL]. [2024-12-08]. https://arxiv.org/pdf/2107.06383. [27] WU M R, ZHANG X Y, SUN X S, et al. DIFNet: Boosting Visual Information Flow for Image Captioning // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 17999-18008. [28] YANG X, PENG J W, WANG Z H, et al. Transforming Visual Scene Graphs to Image Captions // Proc of the 61st Annual Mee-ting of the Association for Computational Linguistics(Long Papers). Stroudsburg, USA: ACL, 2023: 12427-12440. [29] WANG B, ZHANG Z, ZHAO S Y, et al. CropCap: Embedding Visual Cross-Partition Dependency for Image Captioning // Proc of the 31st ACM International Conference on Multimedia. New York, USA: ACM, 2023: 1750-1758. |
|
|
|