Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis

Mehta, Shivam; Deichler, Anna; O'Regan, Jim; Moëll, Birger; Beskow, Jonas; Henter, Gustav Eje; Alexanderson, Simon

Computer Science > Human-Computer Interaction

arXiv:2404.19622 (cs)

[Submitted on 30 Apr 2024]

Title:Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis

Authors:Shivam Mehta, Anna Deichler, Jim O'Regan, Birger Moëll, Jonas Beskow, Gustav Eje Henter, Simon Alexanderson

View PDF HTML (experimental)

Abstract:Although humans engaged in face-to-face conversation simultaneously communicate both verbally and non-verbally, methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. These technologies hold great promise for more human-like, efficient, expressive, and robust synthetic communication, but are currently held back by the lack of suitably large datasets, as existing methods are trained on parallel data from all constituent modalities. Inspired by student-teacher methods, we propose a straightforward solution to the data shortage, by simply synthesising additional training material. Specifically, we use unimodal synthesis models trained on large datasets to create multimodal (but synthetic) parallel training data, and then pre-train a joint synthesis model on that material. In addition, we propose a new synthesis architecture that adds better and more controllable prosody modelling to the state-of-the-art method in the field. Our results confirm that pre-training on large amounts of synthetic data improves the quality of both the speech and the motion synthesised by the multimodal model, with the proposed architecture yielding further benefits when pre-trained on the synthetic data. See this https URL for example output.

Comments:	13+1 pages, 2 figures, accepted at the Human Motion Generation workshop (HuMoGen) at CVPR 2024
Subjects:	Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
MSC classes:	68T07 (Primary), 68T42 (Secondary)
ACM classes:	I.2.7; I.2.6; H.5
Cite as:	arXiv:2404.19622 [cs.HC]
	(or arXiv:2404.19622v1 [cs.HC] for this version)
	https://doi.org/10.48550/arXiv.2404.19622

Submission history

From: Shivam Mehta [view email]
[v1] Tue, 30 Apr 2024 15:22:19 UTC (1,527 KB)

Computer Science > Human-Computer Interaction

Title:Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Human-Computer Interaction

Title:Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators