mHuBERT-147: A Compact Multilingual HuBERT Model

Boito, Marcely Zanon; Iyer, Vivek; Lagos, Nikolaos; Besacier, Laurent; Calapodescu, Ioan

Computer Science > Computation and Language

arXiv:2406.06371 (cs)

[Submitted on 10 Jun 2024 (v1), last revised 23 Aug 2024 (this version, v4)]

Title:mHuBERT-147: A Compact Multilingual HuBERT Model

Authors:Marcely Zanon Boito, Vivek Iyer, Nikolaos Lagos, Laurent Besacier, Ioan Calapodescu

View PDF HTML (experimental)

Abstract:We present mHuBERT-147, the first general-purpose massively multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment than the original method. We also apply a new multilingual batching up-sampling strategy, leveraging both language and dataset diversity. After 3 training iterations, our compact 95M parameter mHuBERT-147 outperforms larger models trained on substantially more data. We rank second and first on the ML-SUPERB 10min and 1h leaderboards, with SOTA scores for 3 tasks. Across ASR/LID tasks, our model consistently surpasses XLS-R (300M params; 436K hours) and demonstrates strong competitiveness against the much larger MMS (1B params; 491K hours). Our findings indicate that mHuBERT-147 is a promising model for multilingual speech tasks, offering an unprecedented balance between high performance and parameter efficiency.

Comments:	Extended version of the Interspeech 2024 paper of same name
Subjects:	Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2406.06371 [cs.CL]
	(or arXiv:2406.06371v4 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.06371

Submission history

From: Marcely Zanon Boito [view email]
[v1] Mon, 10 Jun 2024 15:32:42 UTC (1,121 KB)
[v2] Tue, 11 Jun 2024 14:19:42 UTC (1,121 KB)
[v3] Thu, 27 Jun 2024 07:56:48 UTC (1,121 KB)
[v4] Fri, 23 Aug 2024 13:55:50 UTC (1,121 KB)

Computer Science > Computation and Language

Title:mHuBERT-147: A Compact Multilingual HuBERT Model

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:mHuBERT-147: A Compact Multilingual HuBERT Model

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators