Understanding and Mitigating Tokenization Bias in Language Models

Phan, Buu; Havasi, Marton; Muckley, Matthew; Ullrich, Karen

Computer Science > Computation and Language

arXiv:2406.16829 (cs)

[Submitted on 24 Jun 2024 (v1), last revised 5 Jul 2024 (this version, v2)]

Title:Understanding and Mitigating Tokenization Bias in Language Models

Authors:Buu Phan, Marton Havasi, Matthew Muckley, Karen Ullrich

View PDF HTML (experimental)

Abstract:State-of-the-art language models are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the language models for next-token prediction. We show that popular encoding schemes, such as maximum prefix encoding (MPE) and byte-pair-encoding (BPE), induce a sampling bias that cannot be mitigated with more training or data. To counter this universal problem, for each encoding scheme above, we propose a novel algorithm to obtain unbiased estimates from any language model trained on tokenized data. Our methods do not require finetuning the model, and the complexity, defined as the number of model runs, scales linearly with the sequence length in the case of MPE. As a result, we show that one can simulate token-free behavior from a tokenized language model. We empirically verify the correctness of our method through a Markov-chain setup, where it accurately recovers the transition probabilities, as opposed to the conventional method of directly prompting tokens into the language model.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2406.16829 [cs.CL]
	(or arXiv:2406.16829v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.16829

Submission history

From: Buu Phan [view email]
[v1] Mon, 24 Jun 2024 17:38:02 UTC (454 KB)
[v2] Fri, 5 Jul 2024 21:49:08 UTC (936 KB)

Computer Science > Computation and Language

Title:Understanding and Mitigating Tokenization Bias in Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Understanding and Mitigating Tokenization Bias in Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators