LLM-guided Instance-level Image Manipulation with Diffusion U-Net Cross-Attention Maps

Palaev, Andrey; Khan, Adil; Kazmi, Syed M. Ahsan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.14046 (cs)

[Submitted on 23 Jan 2025]

Title:LLM-guided Instance-level Image Manipulation with Diffusion U-Net Cross-Attention Maps

Authors:Andrey Palaev, Adil Khan, Syed M. Ahsan Kazmi

View PDF HTML (experimental)

Abstract:The advancement of text-to-image synthesis has introduced powerful generative models capable of creating realistic images from textual prompts. However, precise control over image attributes remains challenging, especially at the instance level. While existing methods offer some control through fine-tuning or auxiliary information, they often face limitations in flexibility and accuracy. To address these challenges, we propose a pipeline leveraging Large Language Models (LLMs), open-vocabulary detectors, cross-attention maps and intermediate activations of diffusion U-Net for instance-level image manipulation. Our method detects objects mentioned in the prompt and present in the generated image, enabling precise manipulation without extensive training or input masks. By incorporating cross-attention maps, our approach ensures coherence in manipulated images while controlling object positions. Our method enables precise manipulations at the instance level without fine-tuning or auxiliary information such as masks or bounding boxes. Code is available at this https URL

Comments:	Presented at BMVC 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2501.14046 [cs.CV]
	(or arXiv:2501.14046v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.14046

Submission history

From: Andrey Palaev [view email]
[v1] Thu, 23 Jan 2025 19:26:14 UTC (16,027 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:LLM-guided Instance-level Image Manipulation with Diffusion U-Net Cross-Attention Maps

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:LLM-guided Instance-level Image Manipulation with Diffusion U-Net Cross-Attention Maps

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators