Leveraging Web-Crawled Data for High-Quality Fine-Tuning

Zhou, Jing; Jiang, Chenglin; Shen, Wei; Zhou, Xiao; He, Xiaonan

Computer Science > Computation and Language

arXiv:2408.08003 (cs)

[Submitted on 15 Aug 2024]

Title:Leveraging Web-Crawled Data for High-Quality Fine-Tuning

Authors:Jing Zhou, Chenglin Jiang, Wei Shen, Xiao Zhou, Xiaonan He

View PDF

Abstract:Most large language models are fine-tuned using either expensive human-annotated data or GPT-4 generated data which cannot guarantee performance in certain domains. We argue that although the web-crawled data often has formatting errors causing semantic inaccuracies, it can still serve as a valuable source for high-quality supervised fine-tuning in specific domains without relying on advanced models like GPT-4. To this end, we create a paired training dataset automatically by aligning web-crawled data with a smaller set of high-quality data. By training a language model on this dataset, we can convert web data with irregular formats into high-quality ones. Our experiments show that training with the model-transformed data yields better results, surpassing training with only high-quality data by an average score of 9.4% in Chinese math problems. Additionally, our 7B model outperforms several open-source models larger than 32B and surpasses well-known closed-source models such as GPT-3.5, highlighting the efficacy of our approach.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2408.08003 [cs.CL]
	(or arXiv:2408.08003v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2408.08003

Submission history

From: Jing Zhou [view email]
[v1] Thu, 15 Aug 2024 08:12:52 UTC (412 KB)

Computer Science > Computation and Language

Title:Leveraging Web-Crawled Data for High-Quality Fine-Tuning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Leveraging Web-Crawled Data for High-Quality Fine-Tuning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators