LlamAr & GemmAr: Enhancing LLMs Through Arabic Instruction-Tuning

Chouikhi, Hasna; Aloui, Manel; Hammou, Cyrine Ben; Chaabane, Ghaith; Kchaou, Haithem; Dhaouadi, Chehir

Abstract:Large language models (LLMs) have greatly impacted the natural language processing (NLP) field, particularly for the English language. These models have demonstrated capabilities in understanding and generating human-like text. The success of language models largely depends on the availability of high-quality instruction datasets, which consist of detailed task descriptions and corresponding responses that are essential for training the models to accurately address a variety of prompts. However, the availability and quality of these resources vary by language. While models perform well in English, they often struggle with languages like Arabic, due to the lack of datasets for fine-tuning Arabic-specific tasks. To address this issue, we introduce InstAr-500k, a new Arabic instruction dataset created by generating and collecting content that covers several domains and instruction types. We then assess this dataset by fine-tuning two open-source models, Llama-3-8B-Instruct and Gemma-7B-IT, on several downstream tasks to scale improvements in their functionality. Based on multiple evaluations, our fine-tuned models achieve state-of-the-art performance on several Arabic NLP benchmarks. These outcomes emphasize the effectiveness of our dataset in elevating the capabilities of language models for Arabic. Our instruction dataset bridges the performance gap between English and Arabic language models by providing resources that amplify Arabic NLP development. Building on this foundation, we developed two state-of-the-art models, LlamAr-8B and GemmAr-7B, which are specifically tuned to excel at a wide range of Arabic NLP tasks.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2407.02147 [cs.CL]
	(or arXiv:2407.02147v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2407.02147

✅2024-10-01: arxiv.org is back to normal.✅

Computer Science > Computation and Language

Title:LlamAr & GemmAr: Enhancing LLMs Through Arabic Instruction-Tuning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators