GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis

Xie, Yueqi; Fang, Minghong; Pi, Renjie; Gong, Neil

Computer Science > Computation and Language

arXiv:2402.13494 (cs)

[Submitted on 21 Feb 2024 (v1), last revised 29 May 2024 (this version, v2)]

Title:GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis

Authors:Yueqi Xie, Minghong Fang, Renjie Pi, Neil Gong

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) face threats from jailbreak prompts. Existing methods for detecting jailbreak prompts are primarily online moderation APIs or finetuned LLMs. These strategies, however, often require extensive and resource-intensive data collection and training processes. In this study, we propose GradSafe, which effectively detects jailbreak prompts by scrutinizing the gradients of safety-critical parameters in LLMs. Our method is grounded in a pivotal observation: the gradients of an LLM's loss for jailbreak prompts paired with compliance response exhibit similar patterns on certain safety-critical parameters. In contrast, safe prompts lead to different gradient patterns. Building on this observation, GradSafe analyzes the gradients from prompts (paired with compliance responses) to accurately detect jailbreak prompts. We show that GradSafe, applied to Llama-2 without further training, outperforms Llama Guard, despite its extensive finetuning with a large dataset, in detecting jailbreak prompts. This superior performance is consistent across both zero-shot and adaptation scenarios, as evidenced by our evaluations on ToxicChat and XSTest. The source code is available at this https URL.

Comments:	Accepted to ACL 2024 Main
Subjects:	Computation and Language (cs.CL); Cryptography and Security (cs.CR)
Cite as:	arXiv:2402.13494 [cs.CL]
	(or arXiv:2402.13494v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2402.13494

Submission history

From: Yueqi Xie [view email]
[v1] Wed, 21 Feb 2024 03:09:21 UTC (3,319 KB)
[v2] Wed, 29 May 2024 21:45:35 UTC (10,201 KB)

Computer Science > Computation and Language

Title:GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators