DataDecide: How to Predict Best Pretraining Data with Small Experiments

Magnusson, Ian; Tai, Nguyen; Bogin, Ben; Heineman, David; Hwang, Jena D.; Soldaini, Luca; Bhagia, Akshita; Liu, Jiacheng; Groeneveld, Dirk; Tafjord, Oyvind; Smith, Noah A.; Koh, Pang Wei; Dodge, Jesse

Computer Science > Machine Learning

arXiv:2504.11393 (cs)

[Submitted on 15 Apr 2025]

Title:DataDecide: How to Predict Best Pretraining Data with Small Experiments

Authors:Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, Jesse Dodge

View PDF HTML (experimental)

Abstract:Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in DataDecide -- the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) (~80% of com parisons correct). No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but DataDecide can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval >80% predictable at the target 1B scale with just 0.01% of the compute.

Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:	arXiv:2504.11393 [cs.LG]
	(or arXiv:2504.11393v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2504.11393

Submission history

From: Ian Magnusson [view email]
[v1] Tue, 15 Apr 2025 17:02:15 UTC (1,301 KB)

Computer Science > Machine Learning

Title:DataDecide: How to Predict Best Pretraining Data with Small Experiments

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:DataDecide: How to Predict Best Pretraining Data with Small Experiments

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators