AV-Flow: Transforming Text to Audio-Visual Human-like Interactions

Chatziagapi, Aggelina; Morency, Louis-Philippe; Gong, Hongyu; Zollhoefer, Michael; Samaras, Dimitris; Richard, Alexander

Computer Science > Computer Vision and Pattern Recognition

arXiv:2502.13133 (cs)

[Submitted on 18 Feb 2025]

Title:AV-Flow: Transforming Text to Audio-Visual Human-like Interactions

Authors:Aggelina Chatziagapi, Louis-Philippe Morency, Hongyu Gong, Michael Zollhoefer, Dimitris Samaras, Alexander Richard

View PDF HTML (experimental)

Abstract:We introduce AV-Flow, an audio-visual generative model that animates photo-realistic 4D talking avatars given only text input. In contrast to prior work that assumes an existing speech signal, we synthesize speech and vision jointly. We demonstrate human-like speech synthesis, synchronized lip motion, lively facial expressions and head pose; all generated from just text characters. The core premise of our approach lies in the architecture of our two parallel diffusion transformers. Intermediate highway connections ensure communication between the audio and visual modalities, and thus, synchronized speech intonation and facial dynamics (e.g., eyebrow motion). Our model is trained with flow matching, leading to expressive results and fast inference. In case of dyadic conversations, AV-Flow produces an always-on avatar, that actively listens and reacts to the audio-visual input of a user. Through extensive experiments, we show that our method outperforms prior work, synthesizing natural-looking 4D talking avatars. Project page: this https URL

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2502.13133 [cs.CV]
	(or arXiv:2502.13133v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2502.13133

Submission history

From: Alexander Richard [view email]
[v1] Tue, 18 Feb 2025 18:56:18 UTC (44,752 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:AV-Flow: Transforming Text to Audio-Visual Human-like Interactions

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:AV-Flow: Transforming Text to Audio-Visual Human-like Interactions

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators