Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Huang, Hongzhi; Zhu, Defa; Wu, Banggu; Zeng, Yutao; Wang, Ya; Min, Qiyang; Zhou, Xun

Computer Science > Computation and Language

arXiv:2501.16975 (cs)

[Submitted on 28 Jan 2025]

Title:Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Authors:Hongzhi Huang, Defa Zhu, Banggu Wu, Yutao Zeng, Ya Wang, Qiyang Min, Xun Zhou

View PDF HTML (experimental)

Abstract:Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2501.16975 [cs.CL]
	(or arXiv:2501.16975v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2501.16975

Submission history

From: Hongzhi Huang [view email]
[v1] Tue, 28 Jan 2025 14:15:42 UTC (4,389 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2025-01

Change to browse by:

cs
cs.LG

References & Citations

export BibTeX citation

Computer Science > Computation and Language

Title:Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators