StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation

Min, Dongchan; Song, Minyoung; Hwang, Sung Ju

Computer Science > Computer Vision and Pattern Recognition

arXiv:2208.10922v1 (cs)

[Submitted on 23 Aug 2022 (this version), latest version 15 Mar 2024 (v2)]

Title:StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation

Authors:Dongchan Min, Minyoung Song, Sung Ju Hwang

View PDF

Abstract:We propose StyleTalker, a novel audio-driven talking head generation model that can synthesize a video of a talking person from a single reference image with accurately audio-synced lip shapes, realistic head poses, and eye blinks. Specifically, by leveraging a pretrained image generator and an image encoder, we estimate the latent codes of the talking head video that faithfully reflects the given audio. This is made possible with several newly devised components: 1) A contrastive lip-sync discriminator for accurate lip synchronization, 2) A conditional sequential variational autoencoder that learns the latent motion space disentangled from the lip movements, such that we can independently manipulate the motions and lip movements while preserving the identity. 3) An auto-regressive prior augmented with normalizing flow to learn a complex audio-to-motion multi-modal latent space. Equipped with these components, StyleTalker can generate talking head videos not only in a motion-controllable way when another motion source video is given but also in a completely audio-driven manner by inferring realistic motions from the input audio. Through extensive experiments and user studies, we show that our model is able to synthesize talking head videos with impressive perceptual quality which are accurately lip-synced with the input audios, largely outperforming state-of-the-art baselines.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV)
Cite as:	arXiv:2208.10922 [cs.CV]
	(or arXiv:2208.10922v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2208.10922

Submission history

From: Dongchan Min [view email]
[v1] Tue, 23 Aug 2022 12:49:01 UTC (3,877 KB)
[v2] Fri, 15 Mar 2024 08:48:04 UTC (3,166 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators