An Empirical Exploration in Quality Filtering of Text Data

Gao, Leo

Computer Science > Computation and Language

arXiv:2109.00698 (cs)

[Submitted on 2 Sep 2021 (v1), last revised 6 Oct 2021 (this version, v2)]

Title:An Empirical Exploration in Quality Filtering of Text Data

Authors:Leo Gao

View PDF

Abstract:While conventional wisdom suggests that more aggressively filtering data from low-quality sources like Common Crawl always monotonically improves the quality of training data, we find that aggressive filtering can in fact lead to a decrease in model quality on a wide array of downstream tasks for a GPT-like language model. We speculate that this is because optimizing sufficiently strongly for a proxy metric harms performance on the true objective, suggesting a need for more robust filtering objectives when attempting to filter more aggressively. We hope this work leads to detailed analysis of the effects of dataset filtering design choices on downstream model performance in future work.

Comments:	corrected typo in citation
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2109.00698 [cs.CL]
	(or arXiv:2109.00698v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2109.00698

Submission history

From: Leo Gao [view email]
[v1] Thu, 2 Sep 2021 04:02:51 UTC (6,141 KB)
[v2] Wed, 6 Oct 2021 23:28:47 UTC (6,141 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2021-09

Change to browse by:

References & Citations

DBLP - CS Bibliography

listing | bibtex

export BibTeX citation

Computer Science > Computation and Language

Title:An Empirical Exploration in Quality Filtering of Text Data

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:An Empirical Exploration in Quality Filtering of Text Data

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators