Improving Estonian Text Simplification through Pretrained Language Models and Custom Datasets

Barbu, Eduard; Muru, Meeri-Ly; Malva, Sten Marcus

Computer Science > Computation and Language

arXiv:2501.15624 (cs)

[Submitted on 26 Jan 2025]

Title:Improving Estonian Text Simplification through Pretrained Language Models and Custom Datasets

Authors:Eduard Barbu, Meeri-Ly Muru, Sten Marcus Malva

View PDF HTML (experimental)

Abstract:This study introduces an approach to Estonian text simplification using two model architectures: a neural machine translation model and a fine-tuned large language model (LLaMA). Given the limited resources for Estonian, we developed a new dataset, the Estonian Simplification Dataset, combining translated data and GPT-4.0-generated simplifications. We benchmarked OpenNMT, a neural machine translation model that frames text simplification as a translation task, and fine-tuned the LLaMA model on our dataset to tailor it specifically for Estonian simplification. Manual evaluations on the test set show that the LLaMA model consistently outperforms OpenNMT in readability, grammaticality, and meaning preservation. These findings underscore the potential of large language models for low-resource languages and provide a basis for further research in Estonian text simplification.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2501.15624 [cs.CL]
	(or arXiv:2501.15624v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2501.15624

Submission history

From: Eduard Barbu [view email]
[v1] Sun, 26 Jan 2025 18:10:20 UTC (72 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2025-01

Change to browse by:

References & Citations

export BibTeX citation

Computer Science > Computation and Language

Title:Improving Estonian Text Simplification through Pretrained Language Models and Custom Datasets

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Improving Estonian Text Simplification through Pretrained Language Models and Custom Datasets

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators