Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

Gao, Lang; Zhang, Xiangliang; Nakov, Preslav; Chen, Xiuying

Computer Science > Computation and Language

arXiv:2412.17034 (cs)

[Submitted on 22 Dec 2024]

Title:Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

Authors:Lang Gao, Xiangliang Zhang, Preslav Nakov, Xiuying Chen

View PDF HTML (experimental)

Abstract:Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text. Yet, there is still insufficient understanding of how jailbreaking works, which makes it hard to develop effective defense strategies. We aim to shed more light into this issue: we conduct a detailed large-scale analysis of seven different jailbreak methods and find that these disagreements stem from insufficient observation samples. In particular, we introduce \textit{safety boundary}, and we find that jailbreaks shift harmful activations outside that safety boundary, where LLMs are less sensitive to harmful information. We also find that the low and the middle layers are critical in such shifts, while deeper layers have less impact. Leveraging on these insights, we propose a novel defense called \textbf{Activation Boundary Defense} (ABD), which adaptively constrains the activations within the safety boundary. We further use Bayesian optimization to selectively apply the defense method to the low and the middle layers. Our experiments on several benchmarks show that ABD achieves an average DSR of over 98\% against various forms of jailbreak attacks, with less than 2\% impact on the model's general capabilities.

Comments:	17 pages, 9 figures
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2412.17034 [cs.CL]
	(or arXiv:2412.17034v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2412.17034

Submission history

From: Lang Gao [view email]
[v1] Sun, 22 Dec 2024 14:18:39 UTC (6,743 KB)

Computer Science > Computation and Language

Title:Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators