CALM: Curiosity-Driven Auditing for Large Language Models

Zheng, Xiang; Wang, Longxiang; Liu, Yi; Ma, Xingjun; Shen, Chao; Wang, Cong

Computer Science > Artificial Intelligence

arXiv:2501.02997 (cs)

[Submitted on 6 Jan 2025]

Title:CALM: Curiosity-Driven Auditing for Large Language Models

Authors:Xiang Zheng, Longxiang Wang, Yi Liu, Xingjun Ma, Chao Shen, Cong Wang

View PDF HTML (experimental)

Abstract:Auditing Large Language Models (LLMs) is a crucial and challenging task. In this study, we focus on auditing black-box LLMs without access to their parameters, only to the provided service. We treat this type of auditing as a black-box optimization problem where the goal is to automatically uncover input-output pairs of the target LLMs that exhibit illegal, immoral, or unsafe behaviors. For instance, we may seek a non-toxic input that the target LLM responds to with a toxic output or an input that induces the hallucinative response from the target LLM containing politically sensitive individuals. This black-box optimization is challenging due to the scarcity of feasible points, the discrete nature of the prompt space, and the large search space. To address these challenges, we propose Curiosity-Driven Auditing for Large Language Models (CALM), which uses intrinsically motivated reinforcement learning to finetune an LLM as the auditor agent to uncover potential harmful and biased input-output pairs of the target LLM. CALM successfully identifies derogatory completions involving celebrities and uncovers inputs that elicit specific names under the black-box setting. This work offers a promising direction for auditing black-box LLMs. Our code is available at this https URL.

Comments:	Accepted by AAAI 2025 AI Alignment Track
Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2501.02997 [cs.AI]
	(or arXiv:2501.02997v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2501.02997

Submission history

From: Xiang Zheng [view email]
[v1] Mon, 6 Jan 2025 13:14:34 UTC (1,201 KB)

Computer Science > Artificial Intelligence

Title:CALM: Curiosity-Driven Auditing for Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:CALM: Curiosity-Driven Auditing for Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators