CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models

Zhang, Gaoyang; Fu, Bingtao; Fan, Qingnan; Zhang, Qi; Liu, Runxing; Gu, Hong; Zhang, Huaqi; Liu, Xinguo

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.13195 (cs)

[Submitted on 17 Dec 2024]

Title:CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models

Authors:Gaoyang Zhang, Bingtao Fu, Qingnan Fan, Qi Zhang, Runxing Liu, Hong Gu, Huaqi Zhang, Xinguo Liu

View PDF HTML (experimental)

Abstract:Text-to-image diffusion models excel at generating photorealistic images, but commonly struggle to render accurate spatial relationships described in text prompts. We identify two core issues underlying this common failure: 1) the ambiguous nature of spatial-related data in existing datasets, and 2) the inability of current text encoders to accurately interpret the spatial semantics of input descriptions. We address these issues with CoMPaSS, a versatile training framework that enhances spatial understanding of any T2I diffusion model. CoMPaSS solves the ambiguity of spatial-related data with the Spatial Constraints-Oriented Pairing (SCOP) data engine, which curates spatially-accurate training data through a set of principled spatial constraints. To better exploit the curated high-quality spatial priors, CoMPaSS further introduces a Token ENcoding ORdering (TENOR) module to allow better exploitation of high-quality spatial priors, effectively compensating for the shortcoming of text encoders. Extensive experiments on four popular open-weight T2I diffusion models covering both UNet- and MMDiT-based architectures demonstrate the effectiveness of CoMPaSS by setting new state-of-the-arts with substantial relative gains across well-known benchmarks on spatial relationships generation, including VISOR (+98%), T2I-CompBench Spatial (+67%), and GenEval Position (+131%). Code will be available at this https URL.

Comments:	18 pages, 11 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2412.13195 [cs.CV]
	(or arXiv:2412.13195v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.13195

Submission history

From: Gaoyang Zhang [view email]
[v1] Tue, 17 Dec 2024 18:59:50 UTC (10,131 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators