SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech

Kim, Minchan; Jeong, Myeonghun; Lee, Joun Yeop; Kim, Nam Soo

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2410.04690 (eess)

[Submitted on 7 Oct 2024]

Title:SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech

Authors:Minchan Kim, Myeonghun Jeong, Joun Yeop Lee, Nam Soo Kim

View PDF HTML (experimental)

Abstract:We present SegINR, a novel approach to neural Text-to-Speech (TTS) that addresses sequence alignment without relying on an auxiliary duration predictor and complex autoregressive (AR) or non-autoregressive (NAR) frame-level sequence modeling. SegINR simplifies the process by converting text sequences directly into frame-level features. It leverages an optimal text encoder to extract embeddings, transforming each into a segment of frame-level features using a conditional implicit neural representation (INR). This method, named segment-wise INR (SegINR), models temporal dynamics within each segment and autonomously defines segment boundaries, reducing computational costs. We integrate SegINR into a two-stage TTS framework, using it for semantic token prediction. Our experiments in zero-shot adaptive TTS scenarios demonstrate that SegINR outperforms conventional methods in speech quality with computational efficiency.

Comments:	This work has been submitted to the IEEE for possible publication
Subjects:	Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
Cite as:	arXiv:2410.04690 [eess.AS]
	(or arXiv:2410.04690v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2410.04690

Submission history

From: Minchan Kim [view email]
[v1] Mon, 7 Oct 2024 02:04:58 UTC (722 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators