AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

Haji-Ali, Moayed; Menapace, Willi; Siarohin, Aliaksandr; Skorokhodov, Ivan; Canberk, Alper; Lee, Kwot Sin; Ordonez, Vicente; Tulyakov, Sergey

Computer Science > Computer Vision and Pattern Recognition

arXiv:2412.15191 (cs)

[Submitted on 19 Dec 2024]

Title:AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

Authors:Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Alper Canberk, Kwot Sin Lee, Vicente Ordonez, Sergey Tulyakov

View PDF HTML (experimental)

Abstract:We propose AV-Link, a unified framework for Video-to-Audio and Audio-to-Video generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that enables bidirectional information exchange between our backbone video and audio diffusion models through a temporally-aligned self attention operation. Unlike prior work that uses feature extractors pretrained for other tasks for the conditioning signal, AV-Link can directly leverage features obtained by the complementary modality in a single framework i.e. video features to generate audio, or audio features to generate video. We extensively evaluate our design choices and demonstrate the ability of our method to achieve synchronized and high-quality audiovisual content, showcasing its potential for applications in immersive media generation. Project Page: this http URL

Comments:	Project Page: this http URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2412.15191 [cs.CV]
	(or arXiv:2412.15191v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2412.15191

Submission history

From: Moayed Haji-Ali [view email]
[v1] Thu, 19 Dec 2024 18:57:21 UTC (20,143 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators