Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Ren, Richard; Basart, Steven; Khoja, Adam; Gatti, Alice; Phan, Long; Yin, Xuwang; Mazeika, Mantas; Pan, Alexander; Mukobi, Gabriel; Kim, Ryan H.; Fitz, Stephen; Hendrycks, Dan

Computer Science > Machine Learning

arXiv:2407.21792 (cs)

[Submitted on 31 Jul 2024]

Title:Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Authors:Richard Ren, Steven Basart, Adam Khoja, Alice Gatti, Long Phan, Xuwang Yin, Mantas Mazeika, Alexander Pan, Gabriel Mukobi, Ryan H. Kim, Stephen Fitz, Dan Hendrycks

View PDF

Abstract:As artificial intelligence systems grow more powerful, there has been increasing interest in "AI safety" research to address emerging and future risks. However, the field of AI safety remains poorly defined and inconsistently measured, leading to confusion about how researchers can contribute. This lack of clarity is compounded by the unclear relationship between AI safety benchmarks and upstream general capabilities (e.g., general knowledge and reasoning). To address these issues, we conduct a comprehensive meta-analysis of AI safety benchmarks, empirically analyzing their correlation with general capabilities across dozens of models and providing a survey of existing directions in AI safety. Our findings reveal that many safety benchmarks highly correlate with upstream model capabilities, potentially enabling "safetywashing" -- where capability improvements are misrepresented as safety advancements. Based on these findings, we propose an empirical foundation for developing more meaningful safety metrics and define AI safety in a machine learning research context as a set of clearly delineated research goals that are empirically separable from generic capabilities advancements. In doing so, we aim to provide a more rigorous framework for AI safety research, advancing the science of safety evaluations and clarifying the path towards measurable progress.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
Cite as:	arXiv:2407.21792 [cs.LG]
	(or arXiv:2407.21792v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2407.21792

Submission history

From: Dan Hendrycks [view email]
[v1] Wed, 31 Jul 2024 17:59:24 UTC (675 KB)

Computer Science > Machine Learning

Title:Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators