Self-training Improves Pre-training for Natural Language Understanding

Du, Jingfei; Grave, Edouard; Gunel, Beliz; Chaudhary, Vishrav; Celebi, Onur; Auli, Michael; Stoyanov, Ves; Conneau, Alexis

Computer Science > Computation and Language

arXiv:2010.02194 (cs)

[Submitted on 5 Oct 2020]

Title:Self-training Improves Pre-training for Natural Language Understanding

Authors:Jingfei Du, Edouard Grave, Beliz Gunel, Vishrav Chaudhary, Onur Celebi, Michael Auli, Ves Stoyanov, Alexis Conneau

View PDF

Abstract:Unsupervised pre-training has led to much recent progress in natural language understanding. In this paper, we study self-training as another way to leverage unlabeled data through semi-supervised learning. To obtain additional data for a specific task, we introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data to retrieve sentences from a bank of billions of unlabeled sentences crawled from the web. Unlike previous semi-supervised methods, our approach does not require in-domain unlabeled data and is therefore more generally applicable. Experiments show that self-training is complementary to strong RoBERTa baselines on a variety of tasks. Our augmentation approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks. Finally, we also show strong gains on knowledge-distillation and few-shot learning.

Comments:	8 pages
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2010.02194 [cs.CL]
	(or arXiv:2010.02194v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2010.02194

Submission history

From: Alexis Conneau [view email]
[v1] Mon, 5 Oct 2020 17:52:25 UTC (2,782 KB)

Computer Science > Computation and Language

Title:Self-training Improves Pre-training for Natural Language Understanding

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Self-training Improves Pre-training for Natural Language Understanding

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators