Autoregressive Speech Synthesis with Next-Distribution Prediction

Zhu, Xinfa; Tian, Wenjie; Xie, Lei

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2412.16846 (eess)

[Submitted on 22 Dec 2024]

Title:Autoregressive Speech Synthesis with Next-Distribution Prediction

Authors:Xinfa Zhu, Wenjie Tian, Lei Xie

View PDF HTML (experimental)

Abstract:We introduce KALL-E, a novel autoregressive (AR) language modeling approach with next-distribution prediction for text-to-speech (TTS) synthesis. Unlike existing methods, KALL-E directly models and predicts the continuous speech distribution conditioned on text without relying on VAE- or diffusion-based components. Specifically, we use WaveVAE to extract continuous speech distributions from waveforms instead of using discrete speech tokens. A single AR language model predicts these continuous speech distributions from text, with a Kullback-Leibler divergence loss as the constraint. Experimental results show that KALL-E outperforms open-source implementations of YourTTS, VALL-E, NaturalSpeech 2, and CosyVoice in terms of naturalness and speaker similarity in zero-shot TTS scenarios. Moreover, KALL-E demonstrates exceptional zero-shot capabilities in emotion and accent cloning. Importantly, KALL-E presents a more straightforward and effective paradigm for using continuous speech representations in TTS. Audio samples are available at: \url{this https URL}.

Comments:	Technical report, work in progress
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
Cite as:	arXiv:2412.16846 [eess.AS]
	(or arXiv:2412.16846v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2412.16846

Submission history

From: Xinfa Zhu [view email]
[v1] Sun, 22 Dec 2024 04:03:24 UTC (120 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Autoregressive Speech Synthesis with Next-Distribution Prediction

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Autoregressive Speech Synthesis with Next-Distribution Prediction

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators