Text-to-Image Generation via Dual Optimization Stable Diffusion Model
HUANG Jinjie1,2, LIU Bin1,2
1. School of Automation, Harbin University of Science and Technology, Harbin 150080; 2. Heilongjiang Provincial Key Laboratory of Complex Intelligent System and Integration, Harbin University of Science and Technology, Harbin 150080
Abstract:The stable diffusion(SD) model is unable to ensure full alignment between the generated images and the input textual prompts, while handling text prompts containing multiple objects. Moreover, the complete retraining of the SD model requires enormous computational resources. To solve this problem, a training-free method, text-to-image generation via dual optimization stable diffusion model(DualOpt-SD) is proposed. First, a layout-to-image generation(L2I) model is integrated into a text-to-image generation(T2I) model through a generation framework based on a pre-trained SD model. Next, the dual optimization(DualOpt) strategy is designed to optimize the output noise by the model during the inference process. DualOpt consists of two parts: one part adjusts the prior knowledge learned by L2I and T2I dynamically based on attention scores, and the other part focuses on the requirements of different denoising stages and applies varying attention to L2I and T2I. Experiments demonstrate that when the text prompt contains multiple objects, DualOpt-SD improves compositional accuracy while preserving strong interpretative capabilities of SD model. Furthermore, DualOpt-SD achieves higher overall image generation performance and produces images with high realism and reasonable object placement.
[1] ROMBACH R, BLATTMANN A, LORENZ D, et al. High-Resolution Image Synthesis with Latent Diffusion Models // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 10674-10685. [2] HO J, JAIN A, ABBEEL P.Denoising Diffusion Probabilistic Mo-dels // Proc of the 33rd International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2020: 6840-6851. [3] NICHOL A Q, DHARIWAL P, RAMESH A, et al. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. Proceedings of Machine Learning Research, 2022, 162: 16784-16804. [4] AVRAHAMI O, HAYES T, GAFNI O, et al. SpaText: Spatio-Textual Representation for Controllable Image Generation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 18370-18380. [5] ZHANG L M, RAO A Y, AGRAWALA M.Adding Conditional Con-trol to Text-to-Image Diffusion Models // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2021: 3813-3824. [6] KIM Y, LEE J, KIM J H, et al. Dense Text-to-Image Generation with Attention Modulation // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2023: 7667-7677. [7] CHEN M H, LAINA I, VEDALDI A.Training-Free Layout Control with Cross-Attention Guidance // Proc of the IEEE/CVF Winter Conference on Applications of Computer Vision. Washington, USA: IEEE, 2024: 5331-5341. [8] CHEFER H, ALALUF Y, VINKER Y, et al.Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models // Proc of the 42nd International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2023. DOI: 10.1145/3592116. [9] RAMESH A, DHARIWAL P, NICHOL A, et al. Hierarchical Text-Conditional Image Generation with CLIP Latents[C/OL].[2024-12-25]. https://arxiv.org/pdf/2204.06125. [10] SAHARIA C, CHAN W, SAXENA S, et al.Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding // Proc of the 36th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2022: 36479-36494. [11] MIN S, LYU X X, HOLTZMAN A, et al. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work // Proc of the Conference on Empirical Methods in Natural Language Proce-ssing. Stroudsburg, USA: ACL, 2022: 11048-11064. [12] TAO M, TANG H, WU F, et al. DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 16494-16504. [13] 余凯,宾燚,郑自强,等.基于条件语义增强的文本到图像生成.软件学报, 2024, 35(5): 2150-2164. (YU K, BIN Y, ZHENG Z Q, et al. Text-to-Image Generation with Conditional Semantic Augmentation. Journal of Software, 2024, 35(5): 2150-2164.) [14] DING M, YANG Z Y, HONG W Y, et al.CogView: Mastering Text-to-Image Generation via Transformers // Proc of the 35th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2021: 19822-19835. [15] RAMESH A, PAVLOV M, GOH G, et al. Zero-Shot Text-to-Image Generation. Proceedings of Machine Learning Research, 2021: 139: 8821-8831. [16] 刘子健,王兴梅,陈伟京,等.基于硬负样本对比学习的水下图像生成方法.模式识别与人工智能, 2024, 37(10): 887-909. (LIU Z J, WANG X M, CHEN W J, et al. Underwater Image Generation Method Based on Contrastive Learning with Hard Negative Samples. Pattern Recognition and Artificial Intelligence, 2024, 37(10): 887-909.) [17] XU T, ZHANG P C, HUANG Q Y, et al. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2018: 1316-1324. [18] ZHANG H, XU T, LI H S, et al. StackGAN: Text to Photo-Rea-listic Image Synthesis with Stacked Generative Adversarial Networks // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2017: 5908-5916. [19] ZHANG H, KOH J Y, BALDRIDGE J, et al. Cross-Modal Con-trastive Learning for Text-to-Image Generation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2021: 833-842. [20] LIAO W T, HU K, YANG M Y, et al. Text to Image Generation with Semantic-Spatial Aware GAN // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2022: 18166-18175. [21] VAN DEN OORD A, KALCHBRENNER N, VINYALS O, et al. Conditional Image Generation with PixelCNN Decoders // Proc of the 30th International Conference on Neural Information Processing Systems. Cambridge, USA: MIT Press, 2016: 4797-4805. [22] ZHOU Y, GAO X, CHEN Z C, et al. Attention Distillation: A Uni-fied Approach to Visual Characteristics Transfer[C/OL].[2024-12-25]. https://arxiv.org/pdf/2502.20235 [23] LIANG Y W, HE J F, LI G, et al. Rich Human Feedback for Text-to-Image Generation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 19401-19411. [24] LI M Y, CAI T L, CAO J X, et al. DistriFusion: Distributed Pa-rallel Inference for High-Resolution Diffusion Models // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 7183-7193. [25] XUE S C, LIU Z Q, CHEN F, et al. Accelerating Diffusion Sampling with Optimized Time Steps // Proc of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2024: 8292-8301. [26] KANG M, ZHU J Y, ZHANG R, et al. Scaling up GANs for Text-to-Image Synthesis // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 10124-10134. [27] YANG L, HUANG Z L, SONG Y, et al. Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training[C/OL].[2024-12-25]. https://arxiv.org/pdf/2211.11138. [28] ZHENG G C, ZHOU X P, LI X W, et al. LayoutDiffusion: Controllable Diffusion Model for Layout-to-Image Generation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 22490-22499. [29] LI Y H, LIU H T, WU Q Y, et al. GLIGEN: Open-Set Grounded Text-to-Image Generation // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 22511-22521. [30] QU L G, WU S Q, FEI H, et al. LayoutLLm-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation // Proc of the 31st ACM International Conference on Multimedia. New York, USA: ACM, 2023: 643-654. [31] TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMa: Open and Efficient Foundation Language Models[C/OL].[2024-12-25]. https://arxiv.org/pdf/2302.13971. [32] HUANG Z Q, CHAN K C K, JIANG Y M, et al. Collaborative Diffusion for Multi-modal Face Generation and Editing // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2023: 6080-6090. [33] FENG W X, HE X H, FU T J, et al. Training-Free Structured Di-ffusion Guidance for Compositional Text-to-Image Synthesis[C/OL].[2024-12-25]. https://arxiv.org/pdf/2212.05032. [34] AGARWAL A, KARANAM S, JOSEPH K J, et al. A-STAR: Test-Time Attention Segregation and Retention for Text-to-Image Synthesis // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2023: 2283-2293. [35] 岳忠牧,张喆,吕武,等.De-DDPM:可控、可迁移的缺陷图像生成方法.自动化学报, 2024, 50(8): 1539-1549. (YUE Z M, ZHANG Z, LÜ W, et al. De-DDPM: A Controllable and Transferable Defect Image Generation Method. Acta Automatica Sinica, 2024, 50(8): 1539-1549.) [36] SONG J M, MENG C L, ERMON S.Denoising Diffusion Implicit Models[C/OL]. [2024-12-25].https://arxiv.org/pdf/2010.02502. [37] LI Y M, KEUPER M, ZHANG D, et al. Divide & Bind Your Atten-tion for Improved Generative Semantic Nursing[C/OL].[2024-12-25]. https://papers.bmvc2023.org/0366.pdf. [38] RADFORD A, KIM J W, HALLACY C, et al. Learning Transfe-rable Visual Models from Natural Language Supervision[C/OL].[2024-12-25]. https://arxiv.org/pdf/2103.00020. [39] LIU N, LI S, DU Y L, et al. Compositional Visual Generation with Composable Diffusion Models // Proc of the European Confe-rence on Computer Vision. Berlin, Germany: Springer, 2022: 423-439.