Efficient Scaling of Diffusion Transformers for Text-to-Image Generation

Li, Hao; Lal, Shamit; Li, Zhiheng; Xie, Yusheng; Wang, Ying; Zou, Yang; Majumder, Orchid; Manmatha, R.; Tu, Zhuowen; Ermon, Stefano; Soatto, Stefano; Swaminathan, Ashwin

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.12391 (cs)

[Submitted on 16 Dec 2024]

Title:Efficient Scaling of Diffusion Transformers for Text-to-Image Generation

Authors:Hao Li, Shamit Lal, Zhiheng Li, Yusheng Xie, Ying Wang, Yang Zou, Orchid Majumder, R. Manmatha, Zhuowen Tu, Stefano Ermon, Stefano Soatto, Ashwin Swaminathan

View PDF HTML (experimental)

Abstract:We empirically study the scaling properties of various Diffusion Transformers (DiTs) for text-to-image generation by performing extensive and rigorous ablations, including training scaled DiTs ranging from 0.3B upto 8B parameters on datasets up to 600M images. We find that U-ViT, a pure self-attention based DiT model provides a simpler design and scales more effectively in comparison with cross-attention based DiT variants, which allows straightforward expansion for extra conditions and other modalities. We identify a 2.3B U-ViT model can get better performance than SDXL UNet and other DiT variants in controlled setting. On the data scaling side, we investigate how increasing dataset size and enhanced long caption improve the text-image alignment performance and the learning efficiency.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2412.12391 [cs.CV]
	(or arXiv:2412.12391v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.12391

Submission history

From: Hao Li [view email]
[v1] Mon, 16 Dec 2024 22:59:26 UTC (8,992 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CV

< prev | next >

new | recent | 2024-12

Change to browse by:

cs
cs.CL
cs.LG

References & Citations

export BibTeX citation

Computer Science > Computer Vision and Pattern Recognition

Title:Efficient Scaling of Diffusion Transformers for Text-to-Image Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Efficient Scaling of Diffusion Transformers for Text-to-Image Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators