SambaLingo: Teaching Large Language Models New Languages

Csaki, Zoltan; Li, Bo; Li, Jonathan; Xu, Qiantong; Pawakapan, Pian; Zhang, Leon; Du, Yun; Zhao, Hengyu; Hu, Changran; Thakker, Urmish

Computer Science > Computation and Language

arXiv:2404.05829 (cs)

[Submitted on 8 Apr 2024 (v1), last revised 17 Jul 2024 (this version, v2)]

Title:SambaLingo: Teaching Large Language Models New Languages

Authors:Zoltan Csaki, Bo Li, Jonathan Li, Qiantong Xu, Pian Pawakapan, Leon Zhang, Yun Du, Hengyu Zhao, Changran Hu, Urmish Thakker

View PDF HTML (experimental)

Abstract:Despite the widespread availability of LLMs, there remains a substantial gap in their capabilities and availability across diverse languages. One approach to address these issues has been to take an existing pre-trained LLM and continue to train it on new languages. While prior works have experimented with language adaptation, many questions around best practices and methodology have not been covered. In this paper, we present a comprehensive investigation into the adaptation of LLMs to new languages. Our study covers the key components in this process, including vocabulary extension, direct preference optimization and the data scarcity problem for human alignment in low-resource languages. We scale these experiments across 9 languages and 2 parameter scales (7B and 70B). We compare our models against Llama 2, Aya-101, XGLM, BLOOM and existing language experts, outperforming all prior published baselines. Additionally, all evaluation code and checkpoints are made public to facilitate future research.

Comments:	23 pages
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2404.05829 [cs.CL]
	(or arXiv:2404.05829v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2404.05829

Submission history

From: Zoltan Csaki [view email]
[v1] Mon, 8 Apr 2024 19:48:36 UTC (3,417 KB)
[v2] Wed, 17 Jul 2024 20:30:56 UTC (3,471 KB)

Computer Science > Computation and Language

Title:SambaLingo: Teaching Large Language Models New Languages

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:SambaLingo: Teaching Large Language Models New Languages

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators