Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Cheng, Ho Kei; Ishii, Masato; Hayakawa, Akio; Shibuya, Takashi; Schwing, Alexander; Mitsufuji, Yuki

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.15322 (cs)

[Submitted on 19 Dec 2024]

Title:Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Authors:Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, Yuki Mitsufuji

View PDF HTML (experimental)

Abstract:We propose to synthesize high-quality and synchronized audio, given video and optional text conditions, using a novel multimodal joint training framework MMAudio. In contrast to single-modality training conditioned on (limited) video data only, MMAudio is jointly trained with larger-scale, readily available text-audio data to learn to generate semantically aligned high-quality audio samples. Additionally, we improve audio-visual synchrony with a conditional synchronization module that aligns video conditions with audio latents at the frame level. Trained with a flow matching objective, MMAudio achieves new video-to-audio state-of-the-art among public models in terms of audio quality, semantic alignment, and audio-visual synchronization, while having a low inference time (1.23s to generate an 8s clip) and just 157M parameters. MMAudio also achieves surprisingly competitive performance in text-to-audio generation, showing that joint training does not hinder single-modality performance. Code and demo are available at: this https URL

Comments:	Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2412.15322 [cs.CV]
	(or arXiv:2412.15322v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.15322

Submission history

From: Ho Kei Cheng [view email]
[v1] Thu, 19 Dec 2024 18:59:55 UTC (3,000 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators