Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

Jain, Yash; Chan, David; Dheram, Pranav; Khare, Aparna; Shonibare, Olabanji; Ravichandran, Venkatesh; Ghosh, Shalini

Computer Science > Computation and Language

arXiv:2403.19822 (cs)

[Submitted on 28 Mar 2024]

Title:Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

Authors:Yash Jain, David Chan, Pranav Dheram, Aparna Khare, Olabanji Shonibare, Venkatesh Ravichandran, Shalini Ghosh

View PDF HTML (experimental)

Abstract:Recent advances in machine learning have demonstrated that multi-modal pre-training can improve automatic speech recognition (ASR) performance compared to randomly initialized models, even when models are fine-tuned on uni-modal tasks. Existing multi-modal pre-training methods for the ASR task have primarily focused on single-stage pre-training where a single unsupervised task is used for pre-training followed by fine-tuning on the downstream task. In this work, we introduce a novel method combining multi-modal and multi-task unsupervised pre-training with a translation-based supervised mid-training approach. We empirically demonstrate that such a multi-stage approach leads to relative word error rate (WER) improvements of up to 38.45% over baselines on both Librispeech and SUPERB. Additionally, we share several important findings for choosing pre-training methods and datasets.

Comments:	Accepted in LREC-COLING 2024 - The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2403.19822 [cs.CL]
	(or arXiv:2403.19822v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2403.19822

Submission history

From: Yash Jain [view email]
[v1] Thu, 28 Mar 2024 20:23:39 UTC (1,354 KB)

Computer Science > Computation and Language

Title:Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators