Warmstarting for Scaling Language Models

Mallik, Neeratyoy; Janowski, Maciej; Hog, Johannes; Rakotoarison, Herilalaina; Klein, Aaron; Grabocka, Josif; Hutter, Frank

Computer Science > Machine Learning

arXiv:2411.07340 (cs)

[Submitted on 11 Nov 2024]

Title:Warmstarting for Scaling Language Models

Authors:Neeratyoy Mallik, Maciej Janowski, Johannes Hog, Herilalaina Rakotoarison, Aaron Klein, Josif Grabocka, Frank Hutter

View PDF

Abstract:Scaling model sizes to scale performance has worked remarkably well for the current large language models paradigm. The research and empirical findings of various scaling studies led to novel scaling results and laws that guides subsequent research. High training costs for contemporary scales of data and models result in a lack of thorough understanding of how to tune and arrive at such training setups. One direction to ameliorate the cost of pretraining large models is to warmstart the large-scale training from smaller models that are cheaper to tune. In this work, we attempt to understand if the behavior of optimal hyperparameters can be retained under warmstarting for scaling. We explore simple operations that allow the application of theoretically motivated methods of zero-shot transfer of optimal hyperparameters using {\mu}Transfer. We investigate the aspects that contribute to the speedup in convergence and the preservation of stable training dynamics under warmstarting with {\mu}Transfer. We find that shrinking smaller model weights, zero-padding, and perturbing the resulting larger model with scaled initialization from {\mu}P enables effective warmstarting of $\mut{}$.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2411.07340 [cs.LG]
	(or arXiv:2411.07340v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2411.07340

Submission history

From: Neeratyoy Mallik [view email]
[v1] Mon, 11 Nov 2024 20:02:29 UTC (28,100 KB)

Computer Science > Machine Learning

Title:Warmstarting for Scaling Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Warmstarting for Scaling Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators