Data Shapley in One Training Run

Wang, Jiachen T.; Mittal, Prateek; Song, Dawn; Jia, Ruoxi

Computer Science > Machine Learning

arXiv:2406.11011 (cs)

[Submitted on 16 Jun 2024 (v1), last revised 29 Jun 2024 (this version, v2)]

Title:Data Shapley in One Training Run

Authors:Jiachen T. Wang, Prateek Mittal, Dawn Song, Ruoxi Jia

View PDF HTML (experimental)

Abstract:Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. However, existing approaches require re-training models on different data subsets, which is computationally intensive, foreclosing their application to large-scale models. Furthermore, they produce the same attribution score for any models produced by running the learning algorithm, meaning they cannot perform targeted attribution towards a specific model obtained from a single run of the algorithm. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest. In its most efficient implementation, our technique incurs negligible additional runtime compared to standard model training. This dramatic efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage for the first time. We present several case studies that offer fresh insights into pretraining data's contribution and discuss their implications for copyright in generative AI and pretraining data curation.

Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
Cite as:	arXiv:2406.11011 [cs.LG]
	(or arXiv:2406.11011v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2406.11011

Submission history

From: Jiachen T. Wang [view email]
[v1] Sun, 16 Jun 2024 17:09:24 UTC (7,588 KB)
[v2] Sat, 29 Jun 2024 23:05:32 UTC (7,588 KB)

Computer Science > Machine Learning

Title:Data Shapley in One Training Run

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Data Shapley in One Training Run

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators