Fast Training of Diffusion Models with Masked Transformers

Zheng, Hongkai; Nie, Weili; Vahdat, Arash; Anandkumar, Anima

Computer Science > Computer Vision and Pattern Recognition

arXiv:2306.09305 (cs)

[Submitted on 15 Jun 2023 (v1), last revised 5 Mar 2024 (this version, v2)]

Title:Fast Training of Diffusion Models with Masked Transformers

Authors:Hongkai Zheng, Weili Nie, Arash Vahdat, Anima Anandkumar

View PDF HTML (experimental)

Abstract:We propose an efficient approach to train large diffusion models with masked transformers. While masked transformers have been extensively explored for representation learning, their application to generative learning is less explored in the vision domain. Our work is the first to exploit masked training to reduce the training cost of diffusion models significantly. Specifically, we randomly mask out a high proportion (e.g., 50%) of patches in diffused input images during training. For masked training, we introduce an asymmetric encoder-decoder architecture consisting of a transformer encoder that operates only on unmasked patches and a lightweight transformer decoder on full patches. To promote a long-range understanding of full patches, we add an auxiliary task of reconstructing masked patches to the denoising score matching objective that learns the score of unmasked patches. Experiments on ImageNet-256x256 and ImageNet-512x512 show that our approach achieves competitive and even better generative performance than the state-of-the-art Diffusion Transformer (DiT) model, using only around 30% of its original training time. Thus, our method shows a promising way of efficiently training large transformer-based diffusion models without sacrificing the generative performance.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2306.09305 [cs.CV]
	(or arXiv:2306.09305v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2306.09305

Submission history

From: Hongkai Zheng [view email]
[v1] Thu, 15 Jun 2023 17:38:48 UTC (9,627 KB)
[v2] Tue, 5 Mar 2024 01:10:18 UTC (16,714 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Fast Training of Diffusion Models with Masked Transformers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Fast Training of Diffusion Models with Masked Transformers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators