Cascaded encoders for fine-tuning ASR models on overlapped speech

Rose, Richard; Chang, Oscar; Siohan, Olivier

Computer Science > Sound

arXiv:2306.16398 (cs)

[Submitted on 28 Jun 2023]

Title:Cascaded encoders for fine-tuning ASR models on overlapped speech

Authors:Richard Rose, Oscar Chang, Olivier Siohan

View PDF

Abstract:Multi-talker speech recognition (MT-ASR) has been shown to improve ASR performance on speech containing overlapping utterances from more than one speaker. Multi-talker models have typically been trained from scratch using simulated or actual overlapping speech datasets. On the other hand, the trend in ASR has been to train foundation models using massive datasets collected from a wide variety of task domains. Given the scale of these models and their ability to generalize well across a variety of domains, it makes sense to consider scenarios where a foundation model is augmented with multi-talker capability. This paper presents an MT-ASR model formed by combining a well-trained foundation model with a multi-talker mask model in a cascaded RNN-T encoder configuration. Experimental results show that the cascade configuration provides improved WER on overlapping speech utterances with respect to a baseline multi-talker model without sacrificing performance achievable by the foundation model on non-overlapping utterances.

Subjects:	Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2306.16398 [cs.SD]
	(or arXiv:2306.16398v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2306.16398

Submission history

From: Richard Rose [view email]
[v1] Wed, 28 Jun 2023 17:44:30 UTC (818 KB)

Computer Science > Sound

Title:Cascaded encoders for fine-tuning ASR models on overlapped speech

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Cascaded encoders for fine-tuning ASR models on overlapped speech

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators