Should you marginalize over possible tokenizations?

Chirkova, Nadezhda; Kruszewski, Germán; Rozen, Jos; Dymetman, Marc

Computer Science > Computation and Language

arXiv:2306.17757 (cs)

[Submitted on 30 Jun 2023]

Title:Should you marginalize over possible tokenizations?

Authors:Nadezhda Chirkova, Germán Kruszewski, Jos Rozen, Marc Dymetman

View PDF

Abstract:Autoregressive language models (LMs) map token sequences to probabilities. The usual practice for computing the probability of any character string (e.g. English sentences) is to first transform it into a sequence of tokens that is scored by the model. However, there are exponentially many token sequences that represent any given string. To truly compute the probability of a string one should marginalize over all tokenizations, which is typically intractable. Here, we analyze whether the practice of ignoring the marginalization is justified. To this end, we devise an importance-sampling-based algorithm that allows us to compute estimates of the marginal probabilities and compare them to the default procedure in a range of state-of-the-art models and datasets. Our results show that the gap in log-likelihood is no larger than 0.5% in most cases, but that it becomes more pronounced for data with long complex words.

Comments:	Accepted to ACL 2023
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2306.17757 [cs.CL]
	(or arXiv:2306.17757v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2306.17757

Submission history

From: Nadezhda Chirkova [view email]
[v1] Fri, 30 Jun 2023 16:09:01 UTC (6,983 KB)

Computer Science > Computation and Language

Title:Should you marginalize over possible tokenizations?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Should you marginalize over possible tokenizations?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators