FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering

Henriksson, Erik; Tarkka, Otto; Ginter, Filip

Computer Science > Computation and Language

arXiv:2501.07314 (cs)

[Submitted on 13 Jan 2025]

Title:FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering

Authors:Erik Henriksson, Otto Tarkka, Filip Ginter

View PDF

Abstract:Data quality is crucial for training Large Language Models (LLMs). Traditional heuristic filters often miss low-quality text or mistakenly remove valuable content. In this paper, we introduce an LLM-based line-level filtering method to enhance training data quality. We use GPT-4o mini to label a 20,000-document sample from FineWeb at the line level, allowing the model to create descriptive labels for low-quality lines. These labels are grouped into nine main categories, and we train a DeBERTa-v3 classifier to scale the filtering to a 10B-token subset of FineWeb. To test the impact of our filtering, we train GPT-2 models on both the original and the filtered datasets. The results show that models trained on the filtered data achieve higher accuracy on the HellaSwag benchmark and reach their performance targets faster, even with up to 25\% less data. This demonstrates that LLM-based line-level filtering can significantly improve data quality and training efficiency for LLMs. We release our quality-annotated dataset, FinerWeb-10BT, and the codebase to support further work in this area.

Comments:	11 pages, 4 figures, 4 tables. To be published in NoDaLiDa/Baltic-HLT 2025 proceedings
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2501.07314 [cs.CL]
	(or arXiv:2501.07314v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2501.07314

Submission history

From: Otto Tarkka [view email]
[v1] Mon, 13 Jan 2025 13:26:50 UTC (192 KB)

Computer Science > Computation and Language

Title:FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators