Tamil-Llama: A New Tamil Language Model Based on Llama 2

Balachandran, Abhinand

Computer Science > Computation and Language

arXiv:2311.05845 (cs)

[Submitted on 10 Nov 2023]

Title:Tamil-Llama: A New Tamil Language Model Based on Llama 2

Authors:Abhinand Balachandran

View PDF

Abstract:Language modeling has witnessed remarkable advancements in recent years, with Large Language Models (LLMs) like ChatGPT setting unparalleled benchmarks in human-like text generation. However, a prevailing limitation is the underrepresentation of languages like Tamil in these cutting-edge models, leading to suboptimal performance in diverse linguistic contexts. This paper addresses this lacuna, enhancing the open-source LLaMA model with an addition of 16,000 Tamil tokens, aiming to achieve superior text generation and comprehension in the Tamil language. We strategically employ the LoRA methodology for efficient model training on a comprehensive Tamil corpus, ensuring computational feasibility and model robustness. Moreover, we introduce a Tamil-translated version of the Alpaca dataset and a subset of the OpenOrca dataset tailored for instruction fine-tuning. Our results showcase significant performance improvements in Tamil text generation, with potential implications for the broader landscape of LLMs in Indian languages. We further underscore our commitment to open research by making our models, datasets, and code publicly accessible, fostering further innovations in language modeling.

Comments:	19 pages, 10 figures
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2311.05845 [cs.CL]
	(or arXiv:2311.05845v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2311.05845

Submission history

From: Abhinand Balachandran [view email]
[v1] Fri, 10 Nov 2023 03:02:39 UTC (366 KB)

Computer Science > Computation and Language

Title:Tamil-Llama: A New Tamil Language Model Based on Llama 2

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Tamil-Llama: A New Tamil Language Model Based on Llama 2

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators