Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation

Zhu, Ye; Wu, Yu; Olszewski, Kyle; Ren, Jian; Tulyakov, Sergey; Yan, Yan

Computer Science > Computer Vision and Pattern Recognition

arXiv:2206.07771 (cs)

[Submitted on 15 Jun 2022 (v1), last revised 16 Feb 2023 (this version, v2)]

Title:Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation

Authors:Ye Zhu, Yu Wu, Kyle Olszewski, Jian Ren, Sergey Tulyakov, Yan Yan

View PDF

Abstract:Diffusion probabilistic models (DPMs) have become a popular approach to conditional generation, due to their promising results and support for cross-modal synthesis. A key desideratum in conditional synthesis is to achieve high correspondence between the conditioning input and generated output. Most existing methods learn such relationships implicitly, by incorporating the prior into the variational lower bound. In this work, we take a different route -- we explicitly enhance input-output connections by maximizing their mutual information. To this end, we introduce a Conditional Discrete Contrastive Diffusion (CDCD) loss and design two contrastive diffusion mechanisms to effectively incorporate it into the denoising process, combining the diffusion training and contrastive learning for the first time by connecting it with the conventional variational objectives. We demonstrate the efficacy of our approach in evaluations with diverse multimodal conditional synthesis tasks: dance-to-music generation, text-to-image synthesis, as well as class-conditioned image synthesis. On each, we enhance the input-output correspondence and achieve higher or competitive general synthesis quality. Furthermore, the proposed approach improves the convergence of diffusion models, reducing the number of required diffusion steps by more than 35% on two benchmarks, significantly increasing the inference speed.

Comments:	ICLR 2023. Project at this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2206.07771 [cs.CV]
	(or arXiv:2206.07771v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2206.07771

Submission history

From: Ye Zhu [view email]
[v1] Wed, 15 Jun 2022 19:13:49 UTC (1,811 KB)
[v2] Thu, 16 Feb 2023 18:00:31 UTC (3,180 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators