AutoPureData: Automated Filtering of Web Data for LLM Fine-tuning

Vadlapati, Praneeth

Computer Science > Computation and Language

arXiv:2406.19271 (cs)

[Submitted on 27 Jun 2024]

Title:AutoPureData: Automated Filtering of Web Data for LLM Fine-tuning

Authors:Praneeth Vadlapati

View PDF HTML (experimental)

Abstract:Up-to-date and reliable Large Language Models (LLMs) are consistently sought after. Typically, LLMs are trained on a fixed dataset and then deployed. However, the training data continually becomes outdated. Enable automatic training of AI using web data involves significant concerns regarding data quality and safety due to bias, spam, and other unsafe or unwanted text. Pure data is essential for producing reliable models. Training a model on impure data may result in undesirable outcomes. This research proposes a system that collects web data and automatically filters out unwanted text with the assistance of existing trusted AI models. In the experiment, a small sample of web data was collected and filtered, demonstrating the system's effectiveness in purifying the data.

Comments:	Initial version
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2406.19271 [cs.CL]
	(or arXiv:2406.19271v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.19271

Submission history

From: Praneeth Vadlapati [view email]
[v1] Thu, 27 Jun 2024 15:37:57 UTC (7 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2024-06

Change to browse by:

References & Citations

export BibTeX citation

Computer Science > Computation and Language

Title:AutoPureData: Automated Filtering of Web Data for LLM Fine-tuning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:AutoPureData: Automated Filtering of Web Data for LLM Fine-tuning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators