Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation

Sun, Zengkui; Liu, Yijin; Meng, Fandong; Chen, Yufeng; Xu, Jinan; Zhou, Jie

Abstract:The widespread deployment of Large Language Models (LLMs) is hindered by the high computational demands, making knowledge distillation (KD) crucial for developing compact smaller ones. However, the conventional KD methods endure the distribution mismatch issue between the teacher and student models, leading to the poor performance of distillation. For instance, the widely-used KL-based methods suffer the mode-averaging and mode-collapsing problems, since the mismatched probabitliy distribution between both models. Previous studies mainly optimize this issue via different distance calculations towards the distribution of both models. Unfortunately, the distribution mismatch issue still exists in the early stage of the distillation. Hence, to reduce the impact of distribution mismatch, we propose a simple yet efficient method, named Warmup-Distill, which aligns the distillation of the student to that of the teacher in advance of distillation. Specifically, we first detect the distribution of the student model in practical scenarios with its internal knowledge, and then modify the knowledge with low probability via the teacher as the checker. Consequently, Warmup-Distill aligns the internal student's knowledge to that of the teacher, which expands the distribution of the student with the teacher's, and assists the student model to learn better in the subsequent distillation. Experiments on the seven benchmarks demonstrate that Warmup-Distill could provide a warmup student more suitable for distillation, which outperforms the vanilla student by as least +0.4 averaged score among all benchmarks. Noteably, with the assistance of Warmup-Distill, the distillation on the math task could yield a further improvement, at most +1.9% accuracy.

Comments:	11 Pages, 4 figures, Code at this https URL
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2502.11766 [cs.CL]
	(or arXiv:2502.11766v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.11766

Computer Science > Computation and Language

Title:Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators