MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization

Fernandez-Lopez, Adriana; Chen, Honglie; Ma, Pingchuan; Yin, Lu; Xiao, Qiao; Petridis, Stavros; Liu, Shiwei; Pantic, Maja

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.17614 (cs)

[Submitted on 25 Jun 2024]

Title:MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization

Authors:Adriana Fernandez-Lopez, Honglie Chen, Pingchuan Ma, Lu Yin, Qiao Xiao, Stavros Petridis, Shiwei Liu, Maja Pantic

View PDF HTML (experimental)

Abstract:Pre-trained models have been a foundational approach in speech recognition, albeit with associated additional costs. In this study, we propose a regularization technique that facilitates the training of visual and audio-visual speech recognition models (VSR and AVSR) from scratch. This approach, abbreviated as \textbf{MSRS} (Multimodal Speech Recognition from Scratch), introduces a sparse regularization that rapidly learns sparse structures within the dense model at the very beginning of training, which receives healthier gradient flow than the dense equivalent. Once the sparse mask stabilizes, our method allows transitioning to a dense model or keeping a sparse model by updating non-zero values. MSRS achieves competitive results in VSR and AVSR with 21.1% and 0.9% WER on the LRS3 benchmark, while reducing training time by at least 2x. We explore other sparse approaches and show that only MSRS enables training from scratch by implicitly masking the weights affected by vanishing gradients.

Comments:	Accepted at Interspeech 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2406.17614 [cs.CV]
	(or arXiv:2406.17614v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.17614

Submission history

From: Adriana Fernandez Lopez [view email]
[v1] Tue, 25 Jun 2024 15:00:43 UTC (2,605 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators