Overtrained Language Models Are Harder to Fine-Tune

Springer, Jacob Mitchell; Goyal, Sachin; Wen, Kaiyue; Kumar, Tanishq; Yue, Xiang; Malladi, Sadhika; Neubig, Graham; Raghunathan, Aditi

Computer Science > Computation and Language

arXiv:2503.19206 (cs)

[Submitted on 24 Mar 2025 (v1), last revised 28 Mar 2025 (this version, v2)]

Title:Overtrained Language Models Are Harder to Fine-Tune

Authors:Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, Aditi Raghunathan

View PDF HTML (experimental)

Abstract:Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this assumption and show that extended pre-training can make models harder to fine-tune, leading to degraded final performance. We term this phenomenon catastrophic overtraining. For example, the instruction-tuned OLMo-1B model pre-trained on 3T tokens leads to over 2% worse performance on multiple standard LLM benchmarks than its 2.3T token counterpart. Through controlled experiments and theoretical analysis, we show that catastrophic overtraining arises from a systematic increase in the broad sensitivity of pre-trained parameters to modifications, including but not limited to fine-tuning. Our findings call for a critical reassessment of pre-training design that considers the downstream adaptability of the model.

Comments:	72 pages, 65 figures, 6 tables
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2503.19206 [cs.CL]
	(or arXiv:2503.19206v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2503.19206

Submission history

From: Jacob Springer [view email]
[v1] Mon, 24 Mar 2025 23:11:56 UTC (2,960 KB)
[v2] Fri, 28 Mar 2025 02:10:05 UTC (2,960 KB)

Computer Science > Computation and Language

Title:Overtrained Language Models Are Harder to Fine-Tune

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Overtrained Language Models Are Harder to Fine-Tune

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators