Matina: A Large-Scale 73B Token Persian Text Corpus

Hosseinbeigi, Sara Bourbour; Taherinezhad, Fatemeh; Faili, Heshaam; Baghbani, Hamed; Nadi, Fatemeh; Amiri, Mostafa

Computer Science > Computation and Language

arXiv:2502.09188 (cs)

[Submitted on 13 Feb 2025]

Title:Matina: A Large-Scale 73B Token Persian Text Corpus

Authors:Sara Bourbour Hosseinbeigi, Fatemeh Taherinezhad, Heshaam Faili, Hamed Baghbani, Fatemeh Nadi, Mostafa Amiri

View PDF HTML (experimental)

Abstract:Text corpora are essential for training models used in tasks like summarization, translation, and large language models (LLMs). While various efforts have been made to collect monolingual and multilingual datasets in many languages, Persian has often been underrepresented due to limited resources for data collection and preprocessing. Existing Persian datasets are typically small and lack content diversity, consisting mainly of weblogs and news articles. This shortage of high-quality, varied data has slowed the development of NLP models and open-source LLMs for Persian. Since model performance depends heavily on the quality of training data, we address this gap by introducing the Matina corpus, a new Persian dataset of 72.9B tokens, carefully preprocessed and deduplicated to ensure high data quality. We further assess its effectiveness by training and evaluating transformer-based models on key NLP tasks. Both the dataset and preprocessing codes are publicly available, enabling researchers to build on and improve this resource for future Persian NLP advancements.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2502.09188 [cs.CL]
	(or arXiv:2502.09188v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.09188

Submission history

From: Fatemeh Taherinezhad [view email]
[v1] Thu, 13 Feb 2025 11:22:19 UTC (280 KB)

Computer Science > Computation and Language

Title:Matina: A Large-Scale 73B Token Persian Text Corpus

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Matina: A Large-Scale 73B Token Persian Text Corpus

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators