E-MD3C: Taming Masked Diffusion Transformers for Efficient Zero-Shot Object Customization

Pham, Trung X.; Kang, Zhang; Hong, Ji Woo; Zheng, Xuran; Yoo, Chang D.

Computer Science > Computer Vision and Pattern Recognition

arXiv:2502.09164 (cs)

[Submitted on 13 Feb 2025]

Title:E-MD3C: Taming Masked Diffusion Transformers for Efficient Zero-Shot Object Customization

Authors:Trung X. Pham, Zhang Kang, Ji Woo Hong, Xuran Zheng, Chang D. Yoo

View PDF HTML (experimental)

Abstract:We propose E-MD3C ($\underline{E}$fficient $\underline{M}$asked $\underline{D}$iffusion Transformer with Disentangled $\underline{C}$onditions and $\underline{C}$ompact $\underline{C}$ollector), a highly efficient framework for zero-shot object image customization. Unlike prior works reliant on resource-intensive Unet architectures, our approach employs lightweight masked diffusion transformers operating on latent patches, offering significantly improved computational efficiency. The framework integrates three core components: (1) an efficient masked diffusion transformer for processing autoencoder latents, (2) a disentangled condition design that ensures compactness while preserving background alignment and fine details, and (3) a learnable Conditions Collector that consolidates multiple inputs into a compact representation for efficient denoising and learning. E-MD3C outperforms the existing approach on the VITON-HD dataset across metrics such as PSNR, FID, SSIM, and LPIPS, demonstrating clear advantages in parameters, memory efficiency, and inference speed. With only $\frac{1}{4}$ of the parameters, our Transformer-based 468M model delivers $2.5\times$ faster inference and uses $\frac{2}{3}$ of the GPU memory compared to an 1720M Unet-based latent diffusion model.

Comments:	16 pages, 14 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2502.09164 [cs.CV]
	(or arXiv:2502.09164v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2502.09164

Submission history

From: Trung Pham [view email]
[v1] Thu, 13 Feb 2025 10:48:11 UTC (7,379 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:E-MD3C: Taming Masked Diffusion Transformers for Efficient Zero-Shot Object Customization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:E-MD3C: Taming Masked Diffusion Transformers for Efficient Zero-Shot Object Customization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators