Compact Language Models via Pruning and Knowledge Distillation

Muralidharan, Saurav; Sreenivas, Sharath Turuvekere; Joshi, Raviraj; Chochowski, Marcin; Patwary, Mostofa; Shoeybi, Mohammad; Catanzaro, Bryan; Kautz, Jan; Molchanov, Pavlo

Computer Science > Computation and Language

arXiv:2407.14679 (cs)

[Submitted on 19 Jul 2024 (v1), last revised 4 Nov 2024 (this version, v2)]

Title:Compact Language Models via Pruning and Knowledge Distillation

Authors:Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov

View PDF HTML (experimental)

Abstract:Large language models (LLMs) targeting different deployment scales and sizes are currently produced by training each variant from scratch; this is extremely compute-intensive. In this paper, we investigate if pruning an existing LLM and then re-training it with a fraction (<3%) of the original training data can be a suitable alternative to repeated, full retraining. To this end, we develop a set of practical and effective compression best practices for LLMs that combine depth, width, attention and MLP pruning with knowledge distillation-based retraining; we arrive at these best practices through a detailed empirical exploration of pruning strategies for each axis, methods to combine axes, distillation strategies, and search techniques for arriving at optimal compressed architectures. We use this guide to compress the Nemotron-4 family of LLMs by a factor of 2-4x, and compare their performance to similarly-sized models on a variety of language modeling tasks. Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch; this results in compute cost savings of 1.8x for training the full model family (15B, 8B, and 4B). Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch, perform comparably to other community models such as Mistral 7B, Gemma 7B and Llama-3 8B, and outperform state-of-the-art compression techniques from the literature. We have open-sourced Minitron model weights on Huggingface, with corresponding supplementary material including example code available on GitHub.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2407.14679 [cs.CL]
	(or arXiv:2407.14679v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2407.14679

Submission history

From: Saurav Muralidharan [view email]
[v1] Fri, 19 Jul 2024 21:47:57 UTC (2,126 KB)
[v2] Mon, 4 Nov 2024 17:36:38 UTC (1,187 KB)

Computer Science > Computation and Language

Title:Compact Language Models via Pruning and Knowledge Distillation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Compact Language Models via Pruning and Knowledge Distillation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators