Pipeline Analysis for Developing Instruct LLMs in Low-Resource Languages: A Case Study on Basque

Corral, Ander; Sarasua, Ixak; Saralegi, Xabier

Computer Science > Computation and Language

arXiv:2412.13922 (cs)

[Submitted on 18 Dec 2024]

Title:Pipeline Analysis for Developing Instruct LLMs in Low-Resource Languages: A Case Study on Basque

Authors:Ander Corral, Ixak Sarasua, Xabier Saralegi

View PDF HTML (experimental)

Abstract:Large language models (LLMs) are typically optimized for resource-rich languages like English, exacerbating the gap between high-resource and underrepresented languages. This work presents a detailed analysis of strategies for developing a model capable of following instructions in a low-resource language, specifically Basque, by focusing on three key stages: pre-training, instruction tuning, and alignment with human preferences. Our findings demonstrate that continual pre-training with a high-quality Basque corpus of around 600 million words improves natural language understanding (NLU) of the foundational model by over 12 points. Moreover, instruction tuning and human preference alignment using automatically translated datasets proved highly effective, resulting in a 24-point improvement in instruction-following performance. The resulting models, Llama-eus-8B and Llama-eus-8B-instruct, establish a new state-of-the-art for Basque in the sub-10B parameter category.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2412.13922 [cs.CL]
	(or arXiv:2412.13922v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2412.13922

Submission history

From: Xabier Saralegi [view email]
[v1] Wed, 18 Dec 2024 15:05:59 UTC (120 KB)

Computer Science > Computation and Language

Title:Pipeline Analysis for Developing Instruct LLMs in Low-Resource Languages: A Case Study on Basque

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Pipeline Analysis for Developing Instruct LLMs in Low-Resource Languages: A Case Study on Basque

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators