MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition

Kim, Sungnyun; Jang, Kangwook; Bae, Sangmin; Cho, Sungwoo; Yun, Se-Young

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2502.10447 (eess)

[Submitted on 11 Feb 2025]

Title:MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition

Authors:Sungnyun Kim, Kangwook Jang, Sangmin Bae, Sungwoo Cho, Se-Young Yun

View PDF HTML (experimental)

Abstract:Audio-visual speech recognition (AVSR) has become critical for enhancing speech recognition in noisy environments by integrating both auditory and visual modalities. However, existing AVSR systems struggle to scale up without compromising computational efficiency. In this study, we introduce MoHAVE (Mixture of Hierarchical Audio-Visual Experts), a novel robust AVSR framework designed to address these scalability constraints. By leveraging a Mixture-of-Experts (MoE) architecture, MoHAVE activates modality-specific expert groups, ensuring dynamic adaptation to various audio-visual inputs with minimal computational overhead. Key contributions of MoHAVE include: (1) a sparse MoE framework that efficiently scales AVSR model capacity, (2) a hierarchical gating mechanism that dynamically utilizes the expert groups based on input context, enhancing adaptability and robustness, and (3) remarkable performance across robust AVSR benchmarks, including LRS3 and MuAViC transcription and translation tasks, setting a new standard for scalable speech recognition systems.

Comments:	Preliminary work
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2502.10447 [eess.AS]
	(or arXiv:2502.10447v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2502.10447

Submission history

From: Sungnyun Kim [view email]
[v1] Tue, 11 Feb 2025 11:01:05 UTC (5,710 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators