Unsupervised Tokenization Learning

Kolonin, Anton; Ramesh, Vignav

Computer Science > Computation and Language

arXiv:2205.11443 (cs)

[Submitted on 23 May 2022 (v1), last revised 15 Dec 2022 (this version, v4)]

Title:Unsupervised Tokenization Learning

Authors:Anton Kolonin, Vignav Ramesh

View PDF

Abstract:In the presented study, we discover that the so-called "transition freedom" metric appears superior for unsupervised tokenization purposes in comparison to statistical metrics such as mutual information and conditional probability, providing F-measure scores in range from 0.71 to 1.0 across explored multilingual corpora. We find that different languages require different offshoots of that metric (such as derivative, variance, and "peak values") for successful tokenization. Larger training corpora do not necessarily result in better tokenization quality, while compressing the models by eliminating statistically weak evidence tends to improve performance. The proposed unsupervised tokenization technique provides quality better than or comparable to lexicon-based ones, depending on the language.

Comments:	16 pages, 9 figures; Paper accepted to the EMNLP 2022 conference
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
Cite as:	arXiv:2205.11443 [cs.CL]
	(or arXiv:2205.11443v4 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2205.11443

Submission history

From: Anton Kolonin Dr. [view email]
[v1] Mon, 23 May 2022 16:33:41 UTC (2,360 KB)
[v2] Sun, 9 Oct 2022 15:23:03 UTC (5,448 KB)
[v3] Thu, 13 Oct 2022 18:48:25 UTC (5,448 KB)
[v4] Thu, 15 Dec 2022 17:26:00 UTC (5,448 KB)

Computer Science > Computation and Language

Title:Unsupervised Tokenization Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Unsupervised Tokenization Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators