Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers

Kim, Juntae; Lee, Jeehye

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2108.10752 (eess)

[Submitted on 22 Aug 2021 (v1), last revised 17 Jun 2022 (this version, v2)]

Title:Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers

Authors:Juntae Kim, Jeehye Lee

View PDF

Abstract:Recurrent neural network transducer (RNN-T) is an end-to-end speech recognition framework converting input acoustic frames into a character sequence. The state-of-the-art encoder network for RNN-T is the Conformer, which can effectively model the local-global context information via its convolution and self-attention layers. Although Conformer RNN-T has shown outstanding performance, most studies have been verified in the setting where the train and test data are drawn from the same domain. The domain mismatch problem for Conformer RNN-T has not been intensively investigated yet, which is an important issue for the product-level speech recognition system. In this study, we identified that fully connected self-attention layers in the Conformer caused high deletion errors, specifically in the long-form out-domain utterances. To address this problem, we introduce sparse self-attention layers for Conformer-based encoder networks, which can exploit local and generalized global information by pruning most of the in-domain fitted global connections. Also, we propose a state reset method for the generalization of the prediction network to cope with long-form utterances. Applying proposed methods to an out-domain test, we obtained 27.6% relative character error rate (CER) reduction compared to the fully connected self-attention layer-based Conformers.

Comments:	To be published in INTERSPEECH 2022
Subjects:	Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
Cite as:	arXiv:2108.10752 [eess.AS]
	(or arXiv:2108.10752v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2108.10752

Submission history

From: Juntae Kim [view email]
[v1] Sun, 22 Aug 2021 08:06:15 UTC (187 KB)
[v2] Fri, 17 Jun 2022 12:14:13 UTC (1,782 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Generalizing RNN-Transducer to Out-Domain Audio via Sparse Self-Attention Layers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators