Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition

Deng, Keqi; Guo, Jinxi; Ma, Yingyi; Moritz, Niko; Woodland, Philip C.; Kalinli, Ozlem; Seltzer, Mike

Computer Science > Computation and Language

arXiv:2412.16464 (cs)

[Submitted on 21 Dec 2024]

Title:Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition

Authors:Keqi Deng, Jinxi Guo, Yingyi Ma, Niko Moritz, Philip C. Woodland, Ozlem Kalinli, Mike Seltzer

View PDF HTML (experimental)

Abstract:While large language models (LLMs) have been applied to automatic speech recognition (ASR), the task of making the model streamable remains a challenge. This paper proposes a novel model architecture, Transducer-Llama, that integrates LLMs into a Factorized Transducer (FT) model, naturally enabling streaming capabilities. Furthermore, given that the large vocabulary of LLMs can cause data sparsity issue and increased training costs for spoken language systems, this paper introduces an efficient vocabulary adaptation technique to align LLMs with speech system vocabularies. The results show that directly optimizing the FT model with a strong pre-trained LLM-based predictor using the RNN-T loss yields some but limited improvements over a smaller pre-trained LM predictor. Therefore, this paper proposes a weak-to-strong LM swap strategy, using a weak LM predictor during RNN-T loss training and then replacing it with a strong LLM. After LM replacement, the minimum word error rate (MWER) loss is employed to finetune the integration of the LLM predictor with the Transducer-Llama model. Experiments on the LibriSpeech and large-scale multi-lingual LibriSpeech corpora show that the proposed streaming Transducer-Llama approach gave a 17% relative WER reduction (WERR) over a strong FT baseline and a 32% WERR over an RNN-T baseline.

Comments:	Accepted by ICASSP 2025
Subjects:	Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2412.16464 [cs.CL]
	(or arXiv:2412.16464v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2412.16464

Submission history

From: Keqi Deng [view email]
[v1] Sat, 21 Dec 2024 03:35:49 UTC (375 KB)

Computer Science > Computation and Language

Title:Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators