Can we use Common Voice to train a Multi-Speaker TTS system?

Ogun, Sewade; Colotte, Vincent; Vincent, Emmanuel

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2210.06370 (eess)

[Submitted on 12 Oct 2022]

Title:Can we use Common Voice to train a Multi-Speaker TTS system?

Authors:Sewade Ogun, Vincent Colotte, Emmanuel Vincent

View PDF

Abstract:Training of multi-speaker text-to-speech (TTS) systems relies on curated datasets based on high-quality recordings or audiobooks. Such datasets often lack speaker diversity and are expensive to collect. As an alternative, recent studies have leveraged the availability of large, crowdsourced automatic speech recognition (ASR) datasets. A major problem with such datasets is the presence of noisy and/or distorted samples, which degrade TTS quality. In this paper, we propose to automatically select high-quality training samples using a non-intrusive mean opinion score (MOS) estimator, WV-MOS. We show the viability of this approach for training a multi-speaker GlowTTS model on the Common Voice English dataset. Our approach improves the overall quality of generated utterances by 1.26 MOS point with respect to training on all the samples and by 0.35 MOS point with respect to training on the LibriTTS dataset. This opens the door to automatic TTS dataset curation for a wider range of languages.

Comments:	To appear in Proc. SLT 2022, Jan 09-12, 2023, Doha, Qatar
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2210.06370 [eess.AS]
	(or arXiv:2210.06370v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2210.06370

Submission history

From: Sewade Ogun [view email]
[v1] Wed, 12 Oct 2022 16:20:54 UTC (62 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Can we use Common Voice to train a Multi-Speaker TTS system?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Can we use Common Voice to train a Multi-Speaker TTS system?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators