LiT: Delving into a Simplified Linear Diffusion Transformer for Image Generation

Wang, Jiahao; Kang, Ning; Yao, Lewei; Chen, Mengzhao; Wu, Chengyue; Zhang, Songyang; Xue, Shuchen; Liu, Yong; Wu, Taiqiang; Liu, Xihui; Zhang, Kaipeng; Zhang, Shifeng; Shao, Wenqi; Li, Zhenguo; Luo, Ping

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.12976 (cs)

[Submitted on 22 Jan 2025]

Title:LiT: Delving into a Simplified Linear Diffusion Transformer for Image Generation

Authors:Jiahao Wang, Ning Kang, Lewei Yao, Mengzhao Chen, Chengyue Wu, Songyang Zhang, Shuchen Xue, Yong Liu, Taiqiang Wu, Xihui Liu, Kaipeng Zhang, Shifeng Zhang, Wenqi Shao, Zhenguo Li, Ping Luo

View PDF HTML (experimental)

Abstract:In commonly used sub-quadratic complexity modules, linear attention benefits from simplicity and high parallelism, making it promising for image synthesis tasks. However, the architectural design and learning strategy for linear attention remain underexplored in this field. In this paper, we offer a suite of ready-to-use solutions for efficient linear diffusion Transformers. Our core contributions include: (1) Simplified Linear Attention using few heads, observing the free-lunch effect of performance without latency increase. (2) Weight inheritance from a fully pre-trained diffusion Transformer: initializing linear Transformer using pre-trained diffusion Transformer and loading all parameters except for those related to linear attention. (3) Hybrid knowledge distillation objective: using a pre-trained diffusion Transformer to help the training of the student linear Transformer, supervising not only the predicted noise but also the variance of the reverse diffusion process. These guidelines lead to our proposed Linear Diffusion Transformer (LiT), an efficient text-to-image Transformer that can be deployed offline on a laptop. Experiments show that in class-conditional 256*256 and 512*512 ImageNet benchmark LiT achieves highly competitive FID while reducing training steps by 80% and 77% compared to DiT. LiT also rivals methods based on Mamba or Gated Linear Attention. Besides, for text-to-image generation, LiT allows for the rapid synthesis of up to 1K resolution photorealistic images. Project page: this https URL.

Comments:	21 pages, 12 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2501.12976 [cs.CV]
	(or arXiv:2501.12976v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.12976

Submission history

From: Jiahao Wang [view email]
[v1] Wed, 22 Jan 2025 16:02:06 UTC (4,380 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:LiT: Delving into a Simplified Linear Diffusion Transformer for Image Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:LiT: Delving into a Simplified Linear Diffusion Transformer for Image Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators