LEDiT: Your Length-Extrapolatable Diffusion Transformer without Positional Encoding

Zhang, Shen; Tan, Yaning; Liang, Siyuan; Li, Linze; Wu, Ge; Chen, Yuhao; Li, Shuheng; Zhao, Zhenyu; Chen, Caihua; Liang, Jiajun; Tang, Yao

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.04344 (cs)

[Submitted on 6 Mar 2025]

Title:LEDiT: Your Length-Extrapolatable Diffusion Transformer without Positional Encoding

Authors:Shen Zhang, Yaning Tan, Siyuan Liang, Linze Li, Ge Wu, Yuhao Chen, Shuheng Li, Zhenyu Zhao, Caihua Chen, Jiajun Liang, Yao Tang

View PDF HTML (experimental)

Abstract:Diffusion transformers(DiTs) struggle to generate images at resolutions higher than their training resolutions. The primary obstacle is that the explicit positional encodings(PE), such as RoPE, need extrapolation which degrades performance when the inference resolution differs from training. In this paper, we propose a Length-Extrapolatable Diffusion Transformer(LEDiT), a simple yet powerful architecture to overcome this limitation. LEDiT needs no explicit PEs, thereby avoiding extrapolation. The key innovations of LEDiT are introducing causal attention to implicitly impart global positional information to tokens, while enhancing locality to precisely distinguish adjacent tokens. Experiments on 256x256 and 512x512 ImageNet show that LEDiT can scale the inference resolution to 512x512 and 1024x1024, respectively, while achieving better image quality compared to current state-of-the-art length extrapolation methods(NTK-aware, YaRN). Moreover, LEDiT achieves strong extrapolation performance with just 100K steps of fine-tuning on a pretrained DiT, demonstrating its potential for integration into existing text-to-image DiTs.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2503.04344 [cs.CV]
	(or arXiv:2503.04344v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.04344

Submission history

From: Shen Zhang [view email]
[v1] Thu, 6 Mar 2025 11:41:36 UTC (17,417 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:LEDiT: Your Length-Extrapolatable Diffusion Transformer without Positional Encoding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:LEDiT: Your Length-Extrapolatable Diffusion Transformer without Positional Encoding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators