Leveraging Unpaired Text Data for Training End-to-End Speech-to-Intent Systems

Huang, Yinghui; Kuo, Hong-Kwang; Thomas, Samuel; Kons, Zvi; Audhkhasi, Kartik; Kingsbury, Brian; Hoory, Ron; Picheny, Michael

Computer Science > Computation and Language

arXiv:2010.04284 (cs)

[Submitted on 8 Oct 2020]

Title:Leveraging Unpaired Text Data for Training End-to-End Speech-to-Intent Systems

Authors:Yinghui Huang, Hong-Kwang Kuo, Samuel Thomas, Zvi Kons, Kartik Audhkhasi, Brian Kingsbury, Ron Hoory, Michael Picheny

View PDF

Abstract:Training an end-to-end (E2E) neural network speech-to-intent (S2I) system that directly extracts intents from speech requires large amounts of intent-labeled speech data, which is time consuming and expensive to collect. Initializing the S2I model with an ASR model trained on copious speech data can alleviate data sparsity. In this paper, we attempt to leverage NLU text resources. We implemented a CTC-based S2I system that matches the performance of a state-of-the-art, traditional cascaded SLU system. We performed controlled experiments with varying amounts of speech and text training data. When only a tenth of the original data is available, intent classification accuracy degrades by 7.6% absolute. Assuming we have additional text-to-intent data (without speech) available, we investigated two techniques to improve the S2I system: (1) transfer learning, in which acoustic embeddings for intent classification are tied to fine-tuned BERT text embeddings; and (2) data augmentation, in which the text-to-intent data is converted into speech-to-intent data using a multi-speaker text-to-speech system. The proposed approaches recover 80% of performance lost due to using limited intent-labeled speech.

Comments:	5 pages, published in ICASSP 2020
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
ACM classes:	I.2.7
Cite as:	arXiv:2010.04284 [cs.CL]
	(or arXiv:2010.04284v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2010.04284

Submission history

From: Hong-Kwang Kuo [view email]
[v1] Thu, 8 Oct 2020 22:16:26 UTC (157 KB)

Computer Science > Computation and Language

Title:Leveraging Unpaired Text Data for Training End-to-End Speech-to-Intent Systems

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Leveraging Unpaired Text Data for Training End-to-End Speech-to-Intent Systems

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators