WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark

Ma, Linhan; Guo, Dake; Song, Kun; Jiang, Yuepeng; Wang, Shuai; Xue, Liumeng; Xu, Weiming; Zhao, Huan; Zhang, Binbin; Xie, Lei

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2406.05763 (eess)

[Submitted on 9 Jun 2024 (v1), last revised 19 Jun 2024 (this version, v3)]

Title:WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark

Authors:Linhan Ma, Dake Guo, Kun Song, Yuepeng Jiang, Shuai Wang, Liumeng Xue, Weiming Xu, Huan Zhao, Binbin Zhang, Lei Xie

View PDF HTML (experimental)

Abstract:With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. Following a more accurate transcription process and quality-based data filtering process, the obtained WenetSpeech4TTS corpus contains $12,800$ hours of paired audio-text data. Furthermore, we have created subsets of varying sizes, categorized by segment quality scores to allow for TTS model training and fine-tuning. VALL-E and NaturalSpeech 2 systems are trained and fine-tuned on these subsets to validate the usability of WenetSpeech4TTS, establishing baselines on benchmark for fair comparison of TTS systems. The corpus and corresponding benchmarks are publicly available on huggingface.

Comments:	Accepted by INTERSPEECH2024
Subjects:	Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2406.05763 [eess.AS]
	(or arXiv:2406.05763v3 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2406.05763

Submission history

From: Dake Guo [view email]
[v1] Sun, 9 Jun 2024 12:32:42 UTC (744 KB)
[v2] Tue, 11 Jun 2024 15:54:33 UTC (744 KB)
[v3] Wed, 19 Jun 2024 04:52:56 UTC (744 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators