TinyBERT: Distilling BERT for Natural Language Understanding

Jiao, Xiaoqi; Yin, Yichun; Shang, Lifeng; Jiang, Xin; Chen, Xiao; Li, Linlin; Wang, Fang; Liu, Qun

Computer Science > Computation and Language

arXiv:1909.10351v3 (cs)

[Submitted on 23 Sep 2019 (v1), revised 3 Dec 2019 (this version, v3), latest version 16 Oct 2020 (v5)]

Title:TinyBERT: Distilling BERT for Natural Language Understanding

Authors:Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu

View PDF

Abstract:Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive and memory intensive, so it is difficult to effectively execute them on some resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we firstly propose a novel transformer distillation method that is a specially designed knowledge distillation (KD) method for transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be well transferred to a small student TinyBERT. Moreover, we introduce a new two-stage learning framework for TinyBERT, which performs transformer distillation at both the pre-training and task-specific learning stages. This framework ensures that TinyBERT can capture both the general-domain and task-specific knowledge of the teacher BERT. TinyBERT is empirically effective and achieves comparable results with BERT in GLUE datasets, while being 7.5x smaller and 9.4x faster on inference. TinyBERT is also significantly better than state-of-the-art baselines, even with only about 28% parameters and 31% inference time of baselines.

Comments:	code: this https URL 13 pages, 2 figures, 9 tables
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:1909.10351 [cs.CL]
	(or arXiv:1909.10351v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.1909.10351

Submission history

From: Yichun Yin [view email]
[v1] Mon, 23 Sep 2019 13:05:35 UTC (1,272 KB)
[v2] Tue, 24 Sep 2019 12:39:36 UTC (1,274 KB)
[v3] Tue, 3 Dec 2019 01:29:39 UTC (3,110 KB)
[v4] Wed, 4 Dec 2019 01:50:34 UTC (3,110 KB)
[v5] Fri, 16 Oct 2020 02:12:46 UTC (875 KB)

Computer Science > Computation and Language

Title:TinyBERT: Distilling BERT for Natural Language Understanding

Submission history

Access Paper:

References & Citations

3 blog links

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:TinyBERT: Distilling BERT for Natural Language Understanding

Submission history

Access Paper:

References & Citations

3 blog links

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators