Data Valuation with Gradient Similarity

Evans, Nathaniel J.; Mills, Gordon B.; Wu, Guanming; Song, Xubo; McWeeney, Shannon

Computer Science > Machine Learning

arXiv:2405.08217 (cs)

[Submitted on 13 May 2024]

Title:Data Valuation with Gradient Similarity

Authors:Nathaniel J. Evans, Gordon B. Mills, Guanming Wu, Xubo Song, Shannon McWeeney

View PDF HTML (experimental)

Abstract:High-quality data is crucial for accurate machine learning and actionable analytics, however, mislabeled or noisy data is a common problem in many domains. Distinguishing low- from high-quality data can be challenging, often requiring expert knowledge and considerable manual intervention. Data Valuation algorithms are a class of methods that seek to quantify the value of each sample in a dataset based on its contribution or importance to a given predictive task. These data values have shown an impressive ability to identify mislabeled observations, and filtering low-value data can boost machine learning performance. In this work, we present a simple alternative to existing methods, termed Data Valuation with Gradient Similarity (DVGS). This approach can be easily applied to any gradient descent learning algorithm, scales well to large datasets, and performs comparably or better than baseline valuation methods for tasks such as corrupted label discovery and noise quantification. We evaluate the DVGS method on tabular, image and RNA expression datasets to show the effectiveness of the method across domains. Our approach has the ability to rapidly and accurately identify low-quality data, which can reduce the need for expert knowledge and manual intervention in data cleaning tasks.

Subjects:	Machine Learning (cs.LG); Genomics (q-bio.GN); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
Cite as:	arXiv:2405.08217 [cs.LG]
	(or arXiv:2405.08217v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2405.08217

Submission history

From: Nathaniel Evans [view email]
[v1] Mon, 13 May 2024 22:10:00 UTC (5,359 KB)

Computer Science > Machine Learning

Title:Data Valuation with Gradient Similarity

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Data Valuation with Gradient Similarity

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators