Less is More: Task-aware Layer-wise Distillation for Language Model Compression

Liang, Chen; Zuo, Simiao; Zhang, Qingru; He, Pengcheng; Chen, Weizhu; Zhao, Tuo

Computer Science > Computation and Language

arXiv:2210.01351 (cs)

[Submitted on 4 Oct 2022 (v1), last revised 5 Jun 2023 (this version, v3)]

Title:Less is More: Task-aware Layer-wise Distillation for Language Model Compression

Authors:Chen Liang, Simiao Zuo, Qingru Zhang, Pengcheng He, Weizhu Chen, Tuo Zhao

View PDF

Abstract:Layer-wise distillation is a powerful tool to compress large models (i.e. teacher models) into small ones (i.e., student models). The student distills knowledge from the teacher by mimicking the hidden representations of the teacher at every intermediate layer. However, layer-wise distillation is difficult. Since the student has a smaller model capacity than the teacher, it is often under-fitted. Furthermore, the hidden representations of the teacher contain redundant information that the student does not necessarily need for the target task's learning. To address these challenges, we propose a novel Task-aware layEr-wise Distillation (TED). TED designs task-aware filters to align the hidden representations of the student and the teacher at each layer. The filters select the knowledge that is useful for the target task from the hidden representations. As such, TED reduces the knowledge gap between the two models and helps the student to fit better on the target task. We evaluate TED in two scenarios: continual pre-training and fine-tuning. TED demonstrates significant and consistent improvements over existing distillation methods in both scenarios. Code is available at this https URL.

Comments:	Proceedings of ICML 2023
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2210.01351 [cs.CL]
	(or arXiv:2210.01351v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2210.01351

Submission history

From: Chen Liang [view email]
[v1] Tue, 4 Oct 2022 03:36:53 UTC (682 KB)
[v2] Wed, 5 Oct 2022 14:48:14 UTC (682 KB)
[v3] Mon, 5 Jun 2023 22:40:20 UTC (690 KB)

Computer Science > Computation and Language

Title:Less is More: Task-aware Layer-wise Distillation for Language Model Compression

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Less is More: Task-aware Layer-wise Distillation for Language Model Compression

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators