AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers

Guan, Jiazhi; Wang, Kaisiyuan; Xu, Zhiliang; Yang, Quanwei; Sun, Yasheng; He, Shengyi; Liang, Borong; Cao, Yukang; Li, Yingying; Feng, Haocheng; Ding, Errui; Wang, Jingdong; Zhao, Youjian; Zhou, Hang; Liu, Ziwei

Computer Science > Graphics

arXiv:2503.19824 (cs)

[Submitted on 25 Mar 2025]

Title:AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers

Authors:Jiazhi Guan, Kaisiyuan Wang, Zhiliang Xu, Quanwei Yang, Yasheng Sun, Shengyi He, Borong Liang, Yukang Cao, Yingying Li, Haocheng Feng, Errui Ding, Jingdong Wang, Youjian Zhao, Hang Zhou, Ziwei Liu

View PDF HTML (experimental)

Abstract:Despite the recent progress of audio-driven video generation, existing methods mostly focus on driving facial movements, leading to non-coherent head and body dynamics. Moving forward, it is desirable yet challenging to generate holistic human videos with both accurate lip-sync and delicate co-speech gestures w.r.t. given audio. In this work, we propose AudCast, a generalized audio-driven human video generation framework adopting a cascade Diffusion-Transformers (DiTs) paradigm, which synthesizes holistic human videos based on a reference image and a given audio. 1) Firstly, an audio-conditioned Holistic Human DiT architecture is proposed to directly drive the movements of any human body with vivid gesture dynamics. 2) Then to enhance hand and face details that are well-knownly difficult to handle, a Regional Refinement DiT leverages regional 3D fitting as the bridge to reform the signals, producing the final results. Extensive experiments demonstrate that our framework generates high-fidelity audio-driven holistic human videos with temporal coherence and fine facial and hand details. Resources can be found at this https URL.

Comments:	Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. Project page: this https URL
Subjects:	Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2503.19824 [cs.GR]
	(or arXiv:2503.19824v1 [cs.GR] for this version)
	https://doi.org/10.48550/arXiv.2503.19824

Submission history

From: Jiazhi Guan [view email]
[v1] Tue, 25 Mar 2025 16:38:23 UTC (5,260 KB)

Computer Science > Graphics

Title:AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Graphics

Title:AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators