TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models

Shing, Makoto; Misaki, Kou; Bao, Han; Yokoi, Sho; Akiba, Takuya

Computer Science > Machine Learning

arXiv:2501.16937 (cs)

[Submitted on 28 Jan 2025 (v1), last revised 29 Jan 2025 (this version, v2)]

Title:TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models

Authors:Makoto Shing, Kou Misaki, Han Bao, Sho Yokoi, Takuya Akiba

View PDF HTML (experimental)

Abstract:Causal language models have demonstrated remarkable capabilities, but their size poses significant challenges for deployment in resource-constrained environments. Knowledge distillation, a widely-used technique for transferring knowledge from a large teacher model to a small student model, presents a promising approach for model compression. A significant remaining issue lies in the major differences between teacher and student models, namely the substantial capacity gap, mode averaging, and mode collapse, which pose barriers during distillation. To address these issues, we introduce $\textit{Temporally Adaptive Interpolated Distillation (TAID)}$, a novel knowledge distillation approach that dynamically interpolates student and teacher distributions through an adaptive intermediate distribution, gradually shifting from the student's initial distribution towards the teacher's distribution. We provide a theoretical analysis demonstrating TAID's ability to prevent mode collapse and empirically show its effectiveness in addressing the capacity gap while balancing mode averaging and mode collapse. Our comprehensive experiments demonstrate TAID's superior performance across various model sizes and architectures in both instruction tuning and pre-training scenarios. Furthermore, we showcase TAID's practical impact by developing two state-of-the-art compact foundation models: $\texttt{TAID-LLM-1.5B}$ for language tasks and $\texttt{TAID-VLM-2B}$ for vision-language tasks. These results demonstrate TAID's effectiveness in creating high-performing and efficient models, advancing the development of more accessible AI technologies.

Comments:	To appear at the 13th International Conference on Learning Representations (ICLR 2025)
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2501.16937 [cs.LG]
	(or arXiv:2501.16937v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2501.16937

Submission history

From: Makoto Shing [view email]
[v1] Tue, 28 Jan 2025 13:31:18 UTC (415 KB)
[v2] Wed, 29 Jan 2025 05:51:25 UTC (415 KB)

Computer Science > Machine Learning

Title:TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators