Unified Auto-Encoding with Masked Diffusion

Hansen-Estruch, Philippe; Vishwanath, Sriram; Zhang, Amy; Tomar, Manan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.17688 (cs)

[Submitted on 25 Jun 2024]

Title:Unified Auto-Encoding with Masked Diffusion

Authors:Philippe Hansen-Estruch, Sriram Vishwanath, Amy Zhang, Manan Tomar

View PDF HTML (experimental)

Abstract:At the core of both successful generative and self-supervised representation learning models there is a reconstruction objective that incorporates some form of image corruption. Diffusion models implement this approach through a scheduled Gaussian corruption process, while masked auto-encoder models do so by masking patches of the image. Despite their different approaches, the underlying similarity in their methodologies suggests a promising avenue for an auto-encoder capable of both de-noising tasks. We propose a unified self-supervised objective, dubbed Unified Masked Diffusion (UMD), that combines patch-based and noise-based corruption techniques within a single auto-encoding framework. Specifically, UMD modifies the diffusion transformer (DiT) training process by introducing an additional noise-free, high masking representation step in the diffusion noising schedule, and utilizes a mixed masked and noised image for subsequent timesteps. By integrating features useful for diffusion modeling and for predicting masked patch tokens, UMD achieves strong performance in downstream generative and representation learning tasks, including linear probing and class-conditional generation. This is achieved without the need for heavy data augmentations, multiple views, or additional encoders. Furthermore, UMD improves over the computational efficiency of prior diffusion based methods in total training time. We release our code at this https URL.

Comments:	19 Pages, 8 Figures, 3Tables
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
ACM classes:	I.2.10
Cite as:	arXiv:2406.17688 [cs.CV]
	(or arXiv:2406.17688v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.17688

Submission history

From: Philippe Hansen-Estruch [view email]
[v1] Tue, 25 Jun 2024 16:24:34 UTC (32,394 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Unified Auto-Encoding with Masked Diffusion

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Unified Auto-Encoding with Masked Diffusion

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators