Development of Pre-Trained Transformer-based Models for the Nepali Language

Thapa, Prajwal; Nyachhyon, Jinu; Sharma, Mridul; Bal, Bal Krishna

Computer Science > Computation and Language

arXiv:2411.15734 (cs)

[Submitted on 24 Nov 2024]

Title:Development of Pre-Trained Transformer-based Models for the Nepali Language

Authors:Prajwal Thapa, Jinu Nyachhyon, Mridul Sharma, Bal Krishna Bal

View PDF HTML (experimental)

Abstract:Transformer-based pre-trained language models have dominated the field of Natural Language Processing (NLP) for quite some time now. However, the Nepali language, spoken by approximately 32 million people worldwide, remains significantly underrepresented in this domain. This underrepresentation is primarily attributed to the scarcity of monolingual data corpora and limited available resources for the Nepali language. While existing efforts have predominantly concentrated on basic encoder-based models, there is a notable gap in the exploration of decoder-based architectures. To address this gap, we have collected 27.5 GB of Nepali text data, approximately 2.4x larger than any previously available Nepali language corpus. Leveraging this data, we pre-trained three different models i.e., BERT, RoBERTa, and GPT-2, exclusively for the Nepali Language. Furthermore, we performed instruction tuning and explored its potential for monolingual Nepali data, providing a foundation for future research. Our models outperformed the existing best model by 2 points on Nep-gLUE benchmark, scoring 95.60 and also outperformed existing models on text generation tasks, demonstrating improvements in both understanding and generating Nepali text.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2411.15734 [cs.CL]
	(or arXiv:2411.15734v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2411.15734

Submission history

From: Mridul Sharma [view email]
[v1] Sun, 24 Nov 2024 06:38:24 UTC (341 KB)

Computer Science > Computation and Language

Title:Development of Pre-Trained Transformer-based Models for the Nepali Language

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Development of Pre-Trained Transformer-based Models for the Nepali Language

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators