The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws

Jin, Tian; Humayun, Ahmed Imtiaz; Evci, Utku; Subramanian, Suvinay; Yazdanbakhsh, Amir; Alistarh, Dan; Dziugaite, Gintare Karolina

Computer Science > Machine Learning

arXiv:2501.12486 (cs)

[Submitted on 21 Jan 2025]

Title:The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws

Authors:Tian Jin, Ahmed Imtiaz Humayun, Utku Evci, Suvinay Subramanian, Amir Yazdanbakhsh, Dan Alistarh, Gintare Karolina Dziugaite

View PDF HTML (experimental)

Abstract:Pruning eliminates unnecessary parameters in neural networks; it offers a promising solution to the growing computational demands of large language models (LLMs). While many focus on post-training pruning, sparse pre-training--which combines pruning and pre-training into a single phase--provides a simpler alternative. In this work, we present the first systematic exploration of optimal sparse pre-training configurations for LLMs through an examination of 80 unique pruning schedules across different sparsity levels and training durations. We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss. These findings provide valuable insights for efficient and effective sparse pre-training of LLMs. Furthermore, we propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training. Through empirical and theoretical validation, we demonstrate that this modified scaling law accurately models evaluation loss for both sparsely and densely pre-trained LLMs, unifying scaling laws across pre-training paradigms. Our findings indicate that while sparse pre-training achieves the same final model quality as dense pre-training for equivalent compute budgets, it provides substantial benefits through reduced model size, enabling significant potential computational savings during inference.

Comments:	17 pages
Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:	arXiv:2501.12486 [cs.LG]
	(or arXiv:2501.12486v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2501.12486

Submission history

From: Tian Jin [view email]
[v1] Tue, 21 Jan 2025 20:23:22 UTC (521 KB)

Computer Science > Machine Learning

Title:The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators