Data-Juicer: A One-Stop Data Processing System for Large Language Models

Chen, Daoyuan; Huang, Yilun; Ma, Zhijian; Chen, Hesen; Pan, Xuchen; Ge, Ce; Gao, Dawei; Xie, Yuexiang; Liu, Zhaoyang; Gao, Jinyang; Li, Yaliang; Ding, Bolin; Zhou, Jingren

Computer Science > Machine Learning

arXiv:2309.02033v1 (cs)

[Submitted on 5 Sep 2023 (this version), latest version 20 Dec 2023 (v3)]

Title:Data-Juicer: A One-Stop Data Processing System for Large Language Models

Authors:Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, Yaliang Li, Bolin Ding, Jingren Zhou

View PDF

Abstract:The immense evolution in Large Language Models (LLMs) has underscored the importance of massive, diverse, and high-quality data. Despite this, existing open-source tools for LLM data processing remain limited and mostly tailored to specific datasets, with an emphasis on the reproducibility of released data over adaptability and usability, inhibiting potential applications. In response, we propose a one-stop, powerful yet flexible and user-friendly LLM data processing system named Data-Juicer. Our system offers over 50 built-in versatile operators and pluggable tools, which synergize modularity, composability, and extensibility dedicated to diverse LLM data processing needs. By incorporating visualized and automatic evaluation capabilities, Data-Juicer enables a timely feedback loop to accelerate data processing and gain data insights. To enhance usability, Data-Juicer provides out-of-the-box components for users with various backgrounds, and fruitful data recipes for LLM pre-training and post-tuning usages. Further, we employ multi-facet system optimization and seamlessly integrate Data-Juicer with both LLM and distributed computing ecosystems, to enable efficient and scalable data processing. Empirical validation of the generated data recipes reveals considerable improvements in LLaMA performance for various pre-training and post-tuning cases, demonstrating up to 7.45% relative improvement of averaged score across 16 LLM benchmarks and 16.25% higher win rate using pair-wise GPT-4 evaluation. The system's efficiency and scalability are also validated, supported by up to 88.7% reduction in single-machine processing time, 77.1% and 73.1% less memory and CPU usage respectively, and 7.91x processing acceleration when utilizing distributed computing ecosystems. Our system, data recipes, and multiple tutorial demos are released, calling for broader research centered on LLM data.

Comments:	Under continuous maintenance and updating; The system, refined data recipes, and demos are at this https URL
Subjects:	Machine Learning (cs.LG); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2309.02033 [cs.LG]
	(or arXiv:2309.02033v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2309.02033

Submission history

From: Yaliang Li [view email]
[v1] Tue, 5 Sep 2023 08:22:07 UTC (1,815 KB)
[v2] Sun, 8 Oct 2023 14:28:58 UTC (1,830 KB)
[v3] Wed, 20 Dec 2023 08:27:40 UTC (2,327 KB)

Computer Science > Machine Learning

Title:Data-Juicer: A One-Stop Data Processing System for Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Data-Juicer: A One-Stop Data Processing System for Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators