Towards Robust Speech Representation Learning for Thousands of Languages

Chen, William; Zhang, Wangyou; Peng, Yifan; Li, Xinjian; Tian, Jinchuan; Shi, Jiatong; Chang, Xuankai; Maiti, Soumi; Livescu, Karen; Watanabe, Shinji

Computer Science > Computation and Language

arXiv:2407.00837 (cs)

[Submitted on 30 Jun 2024 (v1), last revised 2 Jul 2024 (this version, v2)]

Title:Towards Robust Speech Representation Learning for Thousands of Languages

Authors:William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, Jinchuan Tian, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, Shinji Watanabe

View PDF HTML (experimental)

Abstract:Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However, models are still far from supporting the world's 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 million hours of speech from existing publicly accessible corpora with a newly created corpus of 7400+ hours from 4057 languages, which will be publicly released. To handle the diverse conditions of multilingual speech data, we augment the typical SSL masked prediction approach with a novel dereverberation objective, increasing robustness. We evaluate XEUS on several benchmarks, and show that it consistently outperforms or achieves comparable results to state-of-the-art (SOTA) SSL models across a variety of tasks. XEUS sets a new SOTA on the ML-SUPERB benchmark: it outperforms MMS 1B and w2v-BERT 2.0 v2 by 0.8% and 4.4% respectively, despite having less parameters or pre-training data. Checkpoints, code, and data are found in this https URL.

Comments:	Updated affiliations; 20 pages
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2407.00837 [cs.CL]
	(or arXiv:2407.00837v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2407.00837

Submission history

From: William Chen [view email]
[v1] Sun, 30 Jun 2024 21:40:26 UTC (442 KB)
[v2] Tue, 2 Jul 2024 17:23:44 UTC (442 KB)

Computer Science > Computation and Language

Title:Towards Robust Speech Representation Learning for Thousands of Languages

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Towards Robust Speech Representation Learning for Thousands of Languages

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators