Vocabulary Expansion of Chat Models with Unlabeled Target Language Data

Yamaguchi, Atsuki; Morishita, Terufumi; Villavicencio, Aline; Aletras, Nikolaos

Computer Science > Computation and Language

arXiv:2412.11704 (cs)

[Submitted on 16 Dec 2024 (v1), last revised 18 Dec 2024 (this version, v2)]

Title:Vocabulary Expansion of Chat Models with Unlabeled Target Language Data

Authors:Atsuki Yamaguchi, Terufumi Morishita, Aline Villavicencio, Nikolaos Aletras

View PDF HTML (experimental)

Abstract:Chat models (i.e. language models trained to follow instructions through conversation with humans) outperform base models (i.e. trained solely on unlabeled data) in both conversation and general task-solving abilities. These models are generally English-centric and require further adaptation for languages that are underrepresented in or absent from their training data. A common technique for adapting base models is to extend the model's vocabulary with target language tokens, i.e. vocabulary expansion (VE), and then continually pre-train it on language-specific data. Using chat data is ideal for chat model adaptation, but often, either this does not exist or is costly to construct. Alternatively, adapting chat models with unlabeled data is a possible solution, but it could result in catastrophic forgetting. In this paper, we investigate the impact of using unlabeled target language data for VE on chat models for the first time. We first show that off-the-shelf VE generally performs well across target language tasks and models in 71% of cases, though it underperforms in scenarios where source chat models are already strong. To further improve adapted models, we propose post-hoc techniques that inject information from the source model without requiring any further training. Experiments reveal the effectiveness of our methods, helping the adapted models to achieve performance improvements in 87% of cases.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2412.11704 [cs.CL]
	(or arXiv:2412.11704v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2412.11704

Submission history

From: Atsuki Yamaguchi [view email]
[v1] Mon, 16 Dec 2024 12:26:28 UTC (511 KB)
[v2] Wed, 18 Dec 2024 12:29:11 UTC (511 KB)

Computer Science > Computation and Language

Title:Vocabulary Expansion of Chat Models with Unlabeled Target Language Data

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Vocabulary Expansion of Chat Models with Unlabeled Target Language Data

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators