Prune or Retrain: Optimizing the Vocabulary of Multilingual Models for Estonian

Dorkin, Aleksei; Purason, Taido; Sirts, Kairit

Computer Science > Computation and Language

arXiv:2501.02631 (cs)

[Submitted on 5 Jan 2025]

Title:Prune or Retrain: Optimizing the Vocabulary of Multilingual Models for Estonian

Authors:Aleksei Dorkin, Taido Purason, Kairit Sirts

View PDF HTML (experimental)

Abstract:Adapting multilingual language models to specific languages can enhance both their efficiency and performance. In this study, we explore how modifying the vocabulary of a multilingual encoder model to better suit the Estonian language affects its downstream performance on the Named Entity Recognition (NER) task. The motivations for adjusting the vocabulary are twofold: practical benefits affecting the computational cost, such as reducing the input sequence length and the model size, and performance enhancements by tailoring the vocabulary to the particular language. We evaluate the effectiveness of two vocabulary adaptation approaches -- retraining the tokenizer and pruning unused tokens -- and assess their impact on the model's performance, particularly after continual training. While retraining the tokenizer degraded the performance of the NER task, suggesting that longer embedding tuning might be needed, we observed no negative effects on pruning.

Comments:	Published in the Proceedings of the 9th International Workshop on Computational Linguistics for Uralic Languages
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2501.02631 [cs.CL]
	(or arXiv:2501.02631v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2501.02631

Submission history

From: Aleksei Dorkin [view email]
[v1] Sun, 5 Jan 2025 19:21:45 UTC (27 KB)

Computer Science > Computation and Language

Title:Prune or Retrain: Optimizing the Vocabulary of Multilingual Models for Estonian

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Prune or Retrain: Optimizing the Vocabulary of Multilingual Models for Estonian

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators