Mitigating Dataset Harms Requires Stewardship: Lessons from 1000 Papers

Peng, Kenny; Mathur, Arunesh; Narayanan, Arvind

Computer Science > Machine Learning

arXiv:2108.02922 (cs)

[Submitted on 6 Aug 2021 (v1), last revised 21 Nov 2021 (this version, v2)]

Title:Mitigating Dataset Harms Requires Stewardship: Lessons from 1000 Papers

Authors:Kenny Peng, Arunesh Mathur, Arvind Narayanan

View PDF

Abstract:Machine learning datasets have elicited concerns about privacy, bias, and unethical applications, leading to the retraction of prominent datasets such as DukeMTMC, MS-Celeb-1M, and Tiny Images. In response, the machine learning community has called for higher ethical standards in dataset creation. To help inform these efforts, we studied three influential but ethically problematic face and person recognition datasets -- Labeled Faces in the Wild (LFW), MS-Celeb-1M, and DukeMTM -- by analyzing nearly 1000 papers that cite them. We found that the creation of derivative datasets and models, broader technological and social change, the lack of clarity of licenses, and dataset management practices can introduce a wide range of ethical concerns. We conclude by suggesting a distributed approach to harm mitigation that considers the entire life cycle of a dataset.

Subjects:	Machine Learning (cs.LG); Computers and Society (cs.CY)
Cite as:	arXiv:2108.02922 [cs.LG]
	(or arXiv:2108.02922v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2108.02922

Submission history

From: Kenny Peng [view email]
[v1] Fri, 6 Aug 2021 02:52:36 UTC (785 KB)
[v2] Sun, 21 Nov 2021 17:58:58 UTC (97 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.LG

< prev | next >

new | recent | 2021-08

Change to browse by:

cs
cs.CY

References & Citations

DBLP - CS Bibliography

listing | bibtex

Arunesh Mathur
Arvind Narayanan

export BibTeX citation

Computer Science > Machine Learning

Title:Mitigating Dataset Harms Requires Stewardship: Lessons from 1000 Papers

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Mitigating Dataset Harms Requires Stewardship: Lessons from 1000 Papers

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators