Detecting Label Errors by using Pre-Trained Language Models

Chong, Derek; Hong, Jenny; Manning, Christopher D.

Computer Science > Computation and Language

arXiv:2205.12702 (cs)

[Submitted on 25 May 2022 (v1), last revised 15 Dec 2022 (this version, v3)]

Title:Detecting Label Errors by using Pre-Trained Language Models

Authors:Derek Chong, Jenny Hong, Christopher D. Manning

View PDF

Abstract:We show that large pre-trained language models are inherently highly capable of identifying label errors in natural language datasets: simply examining out-of-sample data points in descending order of fine-tuned task loss significantly outperforms more complex error-detection mechanisms proposed in previous work.
To this end, we contribute a novel method for introducing realistic, human-originated label noise into existing crowdsourced datasets such as SNLI and TweetNLP. We show that this noise has similar properties to real, hand-verified label errors, and is harder to detect than existing synthetic noise, creating challenges for model robustness. We argue that human-originated noise is a better standard for evaluation than synthetic noise.
Finally, we use crowdsourced verification to evaluate the detection of real errors on IMDB, Amazon Reviews, and Recon, and confirm that pre-trained models perform at a 9-36% higher absolute Area Under the Precision-Recall Curve than existing models.

Comments:	18 pages, 10 figures. Accepted to EMNLP 2022; typesetting of this version slightly differs from conference version
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2205.12702 [cs.CL]
	(or arXiv:2205.12702v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2205.12702

Submission history

From: Derek Chong [view email]
[v1] Wed, 25 May 2022 11:59:39 UTC (786 KB)
[v2] Tue, 11 Oct 2022 20:47:06 UTC (2,423 KB)
[v3] Thu, 15 Dec 2022 16:01:49 UTC (2,221 KB)

Computer Science > Computation and Language

Title:Detecting Label Errors by using Pre-Trained Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Detecting Label Errors by using Pre-Trained Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators