Latxa: An Open Language Model and Evaluation Suite for Basque

Etxaniz, Julen; Sainz, Oscar; Perez, Naiara; Aldabe, Itziar; Rigau, German; Agirre, Eneko; Ormazabal, Aitor; Artetxe, Mikel; Soroa, Aitor

Computer Science > Computation and Language

arXiv:2403.20266 (cs)

[Submitted on 29 Mar 2024 (v1), last revised 20 Sep 2024 (this version, v2)]

Title:Latxa: An Open Language Model and Evaluation Suite for Basque

Authors:Julen Etxaniz, Oscar Sainz, Naiara Perez, Itziar Aldabe, German Rigau, Eneko Agirre, Aitor Ormazabal, Mikel Artetxe, Aitor Soroa

View PDF

Abstract:We introduce Latxa, a family of large language models for Basque ranging from 7 to 70 billion parameters. Latxa is based on Llama 2, which we continue pretraining on a new Basque corpus comprising 4.3M documents and 4.2B tokens. Addressing the scarcity of high-quality benchmarks for Basque, we further introduce 4 multiple choice evaluation datasets: EusProficiency, comprising 5,169 questions from official language proficiency exams; EusReading, comprising 352 reading comprehension questions; EusTrivia, comprising 1,715 trivia questions from 5 knowledge areas; and EusExams, comprising 16,774 questions from public examinations. In our extensive evaluation, Latxa outperforms all previous open models we compare to by a large margin. In addition, it is competitive with GPT-4 Turbo in language proficiency and understanding, despite lagging behind in reading comprehension and knowledge-intensive tasks. Both the Latxa family of models, as well as our new pretraining corpora and evaluation datasets, are publicly available under open licenses. Our suite enables reproducible research on methods to build LLMs for low-resource languages.

Comments:	ACL 2024
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2403.20266 [cs.CL]
	(or arXiv:2403.20266v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2403.20266
Journal reference:	Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14952--14972. 2024

Submission history

From: Naiara PÃ©rez Miguel [view email]
[v1] Fri, 29 Mar 2024 16:16:48 UTC (317 KB)
[v2] Fri, 20 Sep 2024 11:52:52 UTC (8,356 KB)

Computer Science > Computation and Language

Title:Latxa: An Open Language Model and Evaluation Suite for Basque

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Latxa: An Open Language Model and Evaluation Suite for Basque

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators