Towards Robust FastSpeech 2 by Modelling Residual Multimodality

Kögel, Fabian; Nguyen, Bac; Cardinaux, Fabien

doi:10.21437/Interspeech.2023-879

Computer Science > Sound

arXiv:2306.01442 (cs)

[Submitted on 2 Jun 2023]

Title:Towards Robust FastSpeech 2 by Modelling Residual Multimodality

Authors:Fabian Kögel, Bac Nguyen, Fabien Cardinaux

View PDF

Abstract:State-of-the-art non-autoregressive text-to-speech (TTS) models based on FastSpeech 2 can efficiently synthesise high-fidelity and natural speech. For expressive speech datasets however, we observe characteristic audio distortions. We demonstrate that such artefacts are introduced to the vocoder reconstruction by over-smooth mel-spectrogram predictions, which are induced by the choice of mean-squared-error (MSE) loss for training the mel-spectrogram decoder. With MSE loss FastSpeech 2 is limited to learn conditional averages of the training distribution, which might not lie close to a natural sample if the distribution still appears multimodal after all conditioning signals. To alleviate this problem, we introduce TVC-GMM, a mixture model of Trivariate-Chain Gaussian distributions, to model the residual multimodality. TVC-GMM reduces spectrogram smoothness and improves perceptual audio quality in particular for expressive datasets as shown by both objective and subjective evaluation.

Comments:	Accepted at INTERSPEECH 2023
Subjects:	Sound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2306.01442 [cs.SD]
	(or arXiv:2306.01442v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2306.01442
Related DOI:	https://doi.org/10.21437/Interspeech.2023-879

Submission history

From: Fabian Kögel [view email]
[v1] Fri, 2 Jun 2023 11:03:26 UTC (477 KB)

Computer Science > Sound

Title:Towards Robust FastSpeech 2 by Modelling Residual Multimodality

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Towards Robust FastSpeech 2 by Modelling Residual Multimodality

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators