Scaling Laws of Synthetic Data for Language Models

Qin, Zeyu; Dong, Qingxiu; Zhang, Xingxing; Dong, Li; Huang, Xiaolong; Yang, Ziyi; Khademi, Mahmoud; Zhang, Dongdong; Awadalla, Hany Hassan; Fung, Yi R.; Chen, Weizhu; Cheng, Minhao; Wei, Furu

Computer Science > Computation and Language

arXiv:2503.19551 (cs)

[Submitted on 25 Mar 2025 (v1), last revised 26 Mar 2025 (this version, v2)]

Title:Scaling Laws of Synthetic Data for Language Models

Authors:Zeyu Qin, Qingxiu Dong, Xingxing Zhang, Li Dong, Xiaolong Huang, Ziyi Yang, Mahmoud Khademi, Dongdong Zhang, Hany Hassan Awadalla, Yi R. Fung, Weizhu Chen, Minhao Cheng, Furu Wei

View PDF HTML (experimental)

Abstract:Large language models (LLMs) achieve strong performance across diverse tasks, largely driven by high-quality web data used in pre-training. However, recent studies indicate this data source is rapidly depleting. Synthetic data emerges as a promising alternative, but it remains unclear whether synthetic datasets exhibit predictable scalability comparable to raw pre-training data. In this work, we systematically investigate the scaling laws of synthetic data by introducing SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm. Key findings from our extensive mathematical experiments on SynthLLM include: (1) SynthLLM generates synthetic data that reliably adheres to the rectified scaling law across various model sizes; (2) Performance improvements plateau near 300B tokens; and (3) Larger models approach optimal performance with fewer training tokens. For instance, an 8B model peaks at 1T tokens, while a 3B model requires 4T. Moreover, comparisons with existing synthetic data generation and augmentation methods demonstrate that SynthLLM achieves superior performance and scalability. Our findings highlight synthetic data as a scalable and reliable alternative to organic pre-training corpora, offering a viable path toward continued improvement in model performance.

Comments:	work in progress
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2503.19551 [cs.CL]
	(or arXiv:2503.19551v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2503.19551

Submission history

From: Xingxing Zhang [view email]
[v1] Tue, 25 Mar 2025 11:07:12 UTC (1,831 KB)
[v2] Wed, 26 Mar 2025 11:23:44 UTC (1,984 KB)

Computer Science > Computation and Language

Title:Scaling Laws of Synthetic Data for Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Scaling Laws of Synthetic Data for Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators