BgGPT 1.0: Extending English-centric LLMs to other languages

Alexandrov, Anton; Raychev, Veselin; Dimitrov, Dimitar I.; Zhang, Ce; Vechev, Martin; Toutanova, Kristina

Computer Science > Computation and Language

arXiv:2412.10893 (cs)

[Submitted on 14 Dec 2024]

Title:BgGPT 1.0: Extending English-centric LLMs to other languages

Authors:Anton Alexandrov, Veselin Raychev, Dimitar I. Dimitrov, Ce Zhang, Martin Vechev, Kristina Toutanova

View PDF HTML (experimental)

Abstract:We present BgGPT-Gemma-2-27B-Instruct and BgGPT-Gemma-2-9B-Instruct: continually pretrained and fine-tuned versions of Google's Gemma-2 models, specifically optimized for Bulgarian language understanding and generation. Leveraging Gemma-2's multilingual capabilities and over 100 billion tokens of Bulgarian and English text data, our models demonstrate strong performance in Bulgarian language tasks, setting a new standard for language-specific AI models. Our approach maintains the robust capabilities of the original Gemma-2 models, ensuring that the English language performance remains intact. To preserve the base model capabilities, we incorporate continual learning strategies based on recent Branch-and-Merge techniques as well as thorough curation and selection of training data. We provide detailed insights into our methodology, including the release of model weights with a commercial-friendly license, enabling broader adoption by researchers, companies, and hobbyists. Further, we establish a comprehensive set of benchmarks based on non-public educational data sources to evaluate models on Bulgarian language tasks as well as safety and chat capabilities. Our findings demonstrate the effectiveness of fine-tuning state-of-the-art models like Gemma 2 to enhance language-specific AI applications while maintaining cross-lingual capabilities.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2412.10893 [cs.CL]
	(or arXiv:2412.10893v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2412.10893

Submission history

From: Anton Alexandrov [view email]
[v1] Sat, 14 Dec 2024 16:49:52 UTC (5,260 KB)

Computer Science > Computation and Language

Title:BgGPT 1.0: Extending English-centric LLMs to other languages

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:BgGPT 1.0: Extending English-centric LLMs to other languages

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators