Incorporating Context into Subword Vocabularies

Yehezkel, Shaked; Pinter, Yuval

Computer Science > Computation and Language

arXiv:2210.07095 (cs)

[Submitted on 13 Oct 2022 (v1), last revised 10 Feb 2023 (this version, v2)]

Title:Incorporating Context into Subword Vocabularies

Authors:Shaked Yehezkel, Yuval Pinter

View PDF

Abstract:Most current popular subword tokenizers are trained based on word frequency statistics over a corpus, without considering information about co-occurrence or context. Nevertheless, the resulting vocabularies are used in language models' highly contextualized settings. We present SaGe, a tokenizer that tailors subwords for their downstream use by baking in the contextualized signal at the vocabulary creation phase. We show that SaGe does a better job than current widespread tokenizers in keeping token contexts cohesive, while not incurring a large price in terms of encoding efficiency or domain robustness. SaGe improves performance on English GLUE classification tasks as well as on NER, and on Inference and NER in Turkish, demonstrating its robustness to language properties such as morphological exponence and agglutination.

Comments:	EACL 2023
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2210.07095 [cs.CL]
	(or arXiv:2210.07095v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2210.07095

Submission history

From: Yuval Pinter [view email]
[v1] Thu, 13 Oct 2022 15:22:59 UTC (130 KB)
[v2] Fri, 10 Feb 2023 12:48:37 UTC (311 KB)

Computer Science > Computation and Language

Title:Incorporating Context into Subword Vocabularies

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Incorporating Context into Subword Vocabularies

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators