Read, Watch and Scream! Sound Generation from Text and Video

Jeong, Yujin; Kim, Yunji; Chun, Sanghyuk; Lee, Jiyoung

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.05551 (cs)

[Submitted on 8 Jul 2024]

Title:Read, Watch and Scream! Sound Generation from Text and Video

Authors:Yujin Jeong, Yunji Kim, Sanghyuk Chun, Jiyoung Lee

View PDF HTML (experimental)

Abstract:Multimodal generative models have shown impressive advances with the help of powerful diffusion models. Despite the progress, generating sound solely from text poses challenges in ensuring comprehensive scene depiction and temporal alignment. Meanwhile, video-to-sound generation limits the flexibility to prioritize sound synthesis for specific objects within the scene. To tackle these challenges, we propose a novel video-and-text-to-sound generation method, called ReWaS, where video serves as a conditional control for a text-to-audio generation model. Our method estimates the structural information of audio (namely, energy) from the video while receiving key content cues from a user prompt. We employ a well-performing text-to-sound model to consolidate the video control, which is much more efficient for training multimodal diffusion models with massive triplet-paired (audio-video-text) data. In addition, by separating the generative components of audio, it becomes a more flexible system that allows users to freely adjust the energy, surrounding environment, and primary sound source according to their preferences. Experimental results demonstrate that our method shows superiority in terms of quality, controllability, and training efficiency. Our demo is available at this https URL

Comments:	Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2407.05551 [cs.CV]
	(or arXiv:2407.05551v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.05551

Submission history

From: Jiyoung Lee [view email]
[v1] Mon, 8 Jul 2024 01:59:17 UTC (8,749 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Read, Watch and Scream! Sound Generation from Text and Video

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Read, Watch and Scream! Sound Generation from Text and Video

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators