A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech

Chen, Li-Wei; Watanabe, Shinji; Rudnicky, Alexander

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2302.04215 (eess)

[Submitted on 8 Feb 2023]

Title:A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech

Authors:Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky

View PDF

Abstract:Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-level naturalness. The diversity of human speech, however, often goes beyond the coverage of these corpora. We believe the ability to handle such diversity is crucial for AI systems to achieve human-level communication. Our work explores the use of more abundant real-world data for building speech synthesizers. We train TTS systems using real-world speech from YouTube and podcasts. We observe the mismatch between training and inference alignments in mel-spectrogram based autoregressive models, leading to unintelligible synthesis, and demonstrate that learned discrete codes within multiple code groups effectively resolves this issue. We introduce our MQTTS system whose architecture is designed for multiple code generation and monotonic alignment, along with the use of a clean silence prompt to improve synthesis quality. We conduct ablation analyses to identify the efficacy of our methods. We show that MQTTS outperforms existing TTS systems in several objective and subjective measures.

Comments:	Accepted to AAAI 2023
Subjects:	Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
Cite as:	arXiv:2302.04215 [eess.AS]
	(or arXiv:2302.04215v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2302.04215

Submission history

From: Li-Wei Chen [view email]
[v1] Wed, 8 Feb 2023 17:34:32 UTC (7,860 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators