Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction

Jiang, Yuxin; Wang, Yufei; Wu, Chuhan; Dai, Xinyi; Xu, Yan; Gan, Weinan; Wang, Yasheng; Jiang, Xin; Shang, Lifeng; Tang, Ruiming; Wang, Wei

Computer Science > Computation and Language

arXiv:2504.15573 (cs)

[Submitted on 22 Apr 2025]

Title:Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction

Authors:Yuxin Jiang, Yufei Wang, Chuhan Wu, Xinyi Dai, Yan Xu, Weinan Gan, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Wei Wang

View PDF HTML (experimental)

Abstract:The improvement of LLMs' instruction-following capabilities depends critically on the availability of high-quality instruction-response pairs. While existing automatic data synthetic methods alleviate the burden of manual curation, they often rely heavily on either the quality of seed data or strong assumptions about the structure and content of web documents. To tackle these challenges, we propose Web Reconstruction (WebR), a fully automated framework for synthesizing high-quality instruction-tuning (IT) data directly from raw web documents with minimal assumptions. Leveraging the inherent diversity of raw web content, we conceptualize web reconstruction as an instruction-tuning data synthesis task via a novel dual-perspective paradigm--Web as Instruction and Web as Response--where each web document is designated as either an instruction or a response to trigger the reconstruction process. Comprehensive experiments show that datasets generated by WebR outperform state-of-the-art baselines by up to 16.65% across four instruction-following benchmarks. Notably, WebR demonstrates superior compatibility, data efficiency, and scalability, enabling enhanced domain adaptation with minimal effort. The data and code are publicly available at this https URL.

Comments:	15 pages, 11 figures, 9 tables
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2504.15573 [cs.CL]
	(or arXiv:2504.15573v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2504.15573

Submission history

From: Yuxin Jiang [view email]
[v1] Tue, 22 Apr 2025 04:07:13 UTC (1,478 KB)

Computer Science > Computation and Language

Title:Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators