Fine-grained Emotional Control of Text-To-Speech: Learning To Rank Inter- And Intra-Class Emotion Intensities

Wang, Shijun; Guðnason, Jón; Borth, Damian

Computer Science > Sound

arXiv:2303.01508 (cs)

[Submitted on 2 Mar 2023 (v1), last revised 11 Mar 2023 (this version, v2)]

Title:Fine-grained Emotional Control of Text-To-Speech: Learning To Rank Inter- And Intra-Class Emotion Intensities

Authors:Shijun Wang, Jón Guðnason, Damian Borth

View PDF

Abstract:State-of-the-art Text-To-Speech (TTS) models are capable of producing high-quality speech. The generated speech, however, is usually neutral in emotional expression, whereas very often one would want fine-grained emotional control of words or phonemes. Although still challenging, the first TTS models have been recently proposed that are able to control voice by manually assigning emotion intensity. Unfortunately, due to the neglect of intra-class distance, the intensity differences are often unrecognizable. In this paper, we propose a fine-grained controllable emotional TTS, that considers both inter- and intra-class distances and be able to synthesize speech with recognizable intensity difference. Our subjective and objective experiments demonstrate that our model exceeds two state-of-the-art controllable TTS models for controllability, emotion expressiveness and naturalness.

Comments:	Accepted by ICASSP2023
Subjects:	Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2303.01508 [cs.SD]
	(or arXiv:2303.01508v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2303.01508

Submission history

From: Shijun Wang [view email]
[v1] Thu, 2 Mar 2023 09:09:03 UTC (906 KB)
[v2] Sat, 11 Mar 2023 13:07:06 UTC (906 KB)

Computer Science > Sound

Title:Fine-grained Emotional Control of Text-To-Speech: Learning To Rank Inter- And Intra-Class Emotion Intensities

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Fine-grained Emotional Control of Text-To-Speech: Learning To Rank Inter- And Intra-Class Emotion Intensities

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators