基于大语言模型双向协同的跨模态交互式图像编辑方法

doi:10.16451/j.cnki.issn1003-6059.202507002

Abstract
Figure/Table
References
Related Citation (2)

Download: PDF (5040 KB) HTML (1 KB)
Export: BibTeX | EndNote (RIS)

Abstract Diffusion models exhibit high visual fidelity in image generation tasks. However, they are confronted with critical challenges in image editing, such as ambiguity in user intent interpretation, insufficient control over local details, and lag in interactive response. To address these issues, a cross-modal interactive image editing method based on bidirectional collaboration with large language models(BiC-LLM) is proposed. A bidirectional collaboration mechanism is introduced as its core. The top-down semantic guidance from large language models is combined synergistically with bottom-up direct interaction from users. Therefore, controllability and precision in image editing are fundamentally enhanced by employing semantic enhancement, feature decoupling and a dynamic feedback mechanism. First, a hierarchical semantic-driven module is designed. The user-input text is decoupled and reasoned by the large language model, and fine-grained semantic vectors are generated to interpret user intent precisely. Second, a dynamic control module for vision-structure decoupling is constructed. Multi-level visual feature extractors and object-level modeling are combined to achieve independent control over global structure and local appearance. Finally, a real-time interaction mechanism is introduced to enable users to dynamically intervene in the editing process through mask annotations and parameter adjustments, thereby supporting iterative optimization. Experiments on LSUN, CelebA-HQ, and COCO datasets demonstrate that BiC-LLM significantly outperforms baseline models in terms of textual consistency, structural stability, and interactive controllability. Moreover, BiC-LLM effectively enables multi-object semantic editing in complex scenes while preserving the integrity of unedited regions, demonstrating its robustness and effectiveness in image editing tasks.

Key words： Key Words Interactive Image Editing Cross-Modal Semantic Guidance Large Language Model(LLM) Vision-Structure Decoupling Dynamic Control

Received: 25 June 2025

ZTFLH:

TP183

Fund:Supported by National Natural Science Foundation of China(No.61601214,61976109), Project of Educational Department of Liaoning Province(No.JYTMS20231039), Educational Science Planning Project of Liaoning Province(No.JG22CB252),

Corresponding Authors: SHI Hui, Ph.D., associate professor. Her research interests include AI security, information security, machine learning, and image processing.

About author:: JIN Conghui, Master student. Her research interests include AI security, machine lear-ning, and image processing.

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	SHI Hui
	JIN Conghui

Cite this article:

SHI Hui,JIN Conghui. Cross-Modal Interactive Image Editing Based on Bidirectional Collaboration with Large Language Models[J]. Pattern Recognition and Artificial Intelligence, 2025, 38(7): 596-612.

URL:

http://manu46.magtech.com.cn/Jweb_prai/EN/10.16451/j.cnki.issn1003-6059.202507002 OR http://manu46.magtech.com.cn/Jweb_prai/EN/Y2025/V38/I7/596

[1] SOHL-DICKSTEIN J, WEISS E, MAHESWARANATHAN N, et al. Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. Proceedings of the Machine Learning Research, 2015, 37: 2256-2265.
[2] 黄金杰,刘彬.基于双重优化稳定扩散模型的文本生成图像方法.模式识别与人工智能, 2025, 38(4): 359-373.
(HUANG J J, LIU B. Text-to-Image Generation via Dual Optimization Stable Diffusion Model. Pattern Recognition and Artificial Intelligence, 2025, 38(4): 359-373.)
[3] MENG C L, HE Y T, SONG Y, et al. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations[C/OL].[2025-05-15]. https://arxiv.org/pdf/2108.01073.
[4] BROOKS T, HOLYNSKI A, EFROS A A. InstructPix2Pix: Lear-ning to Follow Image Editing Instructions // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 18392-18402.
[5] HERTZ A, MOKADY R, TENENBAUM J, et al. Prompt-to-Prompt Image Editing with Cross-Attention Control[C/OL].[2025-05-15]. https://openreview.net/pdf?id=_CDixzkzeyb.
[6] YANG B X, GU S Y, ZHANG B, et al. Paint by Example: Exemplar-Based Image Editing with Diffusion Models // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 18381-18391.
[7] 吴福祥,程俊.基于自编码器生成对抗网络的可配置文本图像编辑.软件学报, 2022, 33(9): 3139-3151.
(WU F X, CHENG J. Configurable Text-Based Image Editing by Autoencoder-Based Generative Adversarial Networks. Journal of Software, 2022, 33(9): 3139-3151.)
[8] RUIZ N, LI Y Z, JAMPANI V, et al. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 22500-22510.
[9] KUMARI N, ZHANG B L, ZHANG R, et al. Multi-concept Customization of Text-to-Image Diffusion // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 1931-1941.
[10] 夏垚铮,郝蕾,郑宛露,等.基于语义分离和特征融合的人脸编辑方法.计算机辅助设计与图形学学报, 2025, 37(3): 414-426.
(XIA Y Z, HAO L, ZHENG W L, et al. An Independent Semantic and Fused Latent Model for Local Face Editing. Journal of Computer-Aided Design & Computer Graphics, 2025, 37(3): 414-426.)
[11] XU T, ZHANG P C, HUANG Q Y, et al. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 1316-1324.
[12] LIAO W T, HU K, YANG M Y, et al. Text to Image Generation with Semantic-Spatial Aware GAN // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 18166-18175.
[13] SHI Y J, XUE C H, LIEW J H, et al. DragDiffusion: Harnessing Diffusion Models for Interactive Point-Based Image Editing // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Re-cognition. Washington, USA: IEEE, 2024: 8839-8849.
[14] NGUYEN T T, NGUYEN Q, NGUYEN K, et al. SwiftEdit: Light-ning Fast Text-Guided Image Editing via One-Step Diffusion // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2025: 21492-21501.
[15] DALVA Y, VENKATESH K, YANARDAG P. FluxSpace: Disentangled Semantic Editing in Rectified Flow Models// Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2025: 13083-13092.
[16] LIU Z C, YU Y, OUYANG H, et al. MagicQuill: An Intelligent Interactive Image Editing System // Proc of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2025: 13072-13082.
[17] ZHANG L M, RAO A Y, AGRAWALA M. Adding Conditional Control to Text-to-Image Diffusion Models // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023 : 3813-3824.
[18] MOU C, WANG X T, XIE L B, et al. T2I-Adapter: Learning Adap-ters to Dig Out More Controllable Ability for Text-to-Image Diffusion Models. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, 38(5): 4296-4304.
[19] LUGMAYR A, DANELLJAN M, ROMERO A, et al. RePaint: Inpainting Using Denoising Diffusion Probabilistic Models // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Re-cognition. Washington, USA: IEEE, 2022: 11451-11461.
[20] CHANG H W, ZHANG H, JIANG L, et al. MaskGIT: Masked Generative Image Transformer // Proc of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 11305-11315.
[21] CHEN X, HUANG L H, LIU Y, et al. AnyDoor: Zero-Shot Object-Level Image Customization // Proc of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 6593-6602.
[22] SHIN C, KIM J H, LEE C H, et al. Edit-A-Video: Single Video Editing with Object-Aware Consistency. Proceedings of the Machine Learning Research, 2024, 222: 1215-1230.
[23] SRIVASTAVA A, MENTA T R, JAVA A, et al. REEDIT: Multimodal Exemplar-Based Image Editing // Proc of the IEEE/CVF Winter Conference on Applications of Computer Vision. Washington, USA: IEEE, 2025: 929-939.
[24] LAI B L, JUEFIE-XU F, LIU M, et al. Unleashing In-Context Learning of Autoregressive Models for Few-Shot Image Manipulation // Proc of the IEEE/CVF Conference on Computer Vision and Pa-ttern Recognition. USA: IEEE, 2025: 18346-18357.
[25] HO J, JAIN A, ABBEEL P. Denoising Diffusion Probabilistic Models // Proc of the 34th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2020: 6840-6851.
[26] ROMBACH R, BLATTMANN A, LORENZ D, et al. High-Resolution Image Synthesis with Latent Diffusion Models // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 10674-10685.
[27] VASWANI A, SHAZEER N, PARMAR N, et al. Attention Is All You Need // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2017: 6000-6010.
[28] SAPARINA I, LAPATA M. Improving Generalization in Semantic Parsing by Increasing Natural Language Variation // Proc of the 18th Conference of the European Chapter of the Association for Computational Linguistics(Long Papers). Stroudsburg, USA: ACL, 2024: 1178-1193.
[29] RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transfor-mer. Journal of the Machine Learning Research, 2020, 21(140): 1-67.
[30] OQUAB M, DARCET T, MOUTAKANNI T, et al. DINOv2: Lear-ning Robust Visual Features without Supervision[C/OL].[2025-05-15]. https://openreview.net/pdf?id=a68SUt6zFt.
[31] ZHANG R, ISOLA P, EFROS A A, et al. The Unreasonable Effec-tiveness of Deep Features as a Perceptual Metric // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 586-595.
[32] YU F, SEFF A, ZHANG Y D, et al. LSUN: Construction of a Large-Scale Image Dataset Using Deep Learning with Humans in the Loop[C/OL].[2025-05-15]. https://arxiv.org/pdf/1506.03365.
[33] KARRAS T, AILA T, LAINE S, et al. Progressive Growing of GANs for Improved Quality, Stability, and Variation[C/OL].[2025-05-15]. https://openreview.net/pdf?id=Hk99zCeAb.
[34] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common Objects in Context // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2014: 740-755.
[35] GHIASI G, CUI Y, SRINIVAS A, et al. Simple Copy-Paste Is a Strong Data Augmentation Method for Instance Segmentation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 2917-2927.
[36] BROWN A, FU C Y, PARKHI O, et al. End-to-End Visual Editing with a Generatively Pre-trained Artist // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2022: 18-35.
[37] YU Q F, CHOW W, YUE Z Q, et al. AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2025: 26125-26135.
[38] ZHU P H, ABDAL R, QIN Y P, et al. SEAN: Image Synthesis with Semantic Region-Adaptive Normalization // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2020: 5103-5112.