Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning

Hua, Jiaqi; Wei, Wanxu

Computer Science > Artificial Intelligence

arXiv:2501.07959 (cs)

[Submitted on 14 Jan 2025]

Title:Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning

Authors:Jiaqi Hua, Wanxu Wei

View PDF HTML (experimental)

Abstract:Recently, several works have been conducted on jailbreaking Large Language Models (LLMs) with few-shot malicious demos. In particular, Zheng et al. (2024) focuses on improving the efficiency of Few-Shot Jailbreaking (FSJ) by injecting special tokens into the demos and employing demo-level random search. Nevertheless, this method lacks generality since it specifies the instruction-response structure. Moreover, the reason why inserting special tokens takes effect in inducing harmful behaviors is only empirically discussed. In this paper, we take a deeper insight into the mechanism of special token injection and propose Self-Instruct Few-Shot Jailbreaking (Self-Instruct-FSJ) facilitated with the demo-level greedy search. This framework decomposes the FSJ attack into pattern and behavior learning to exploit the model's vulnerabilities in a more generalized and efficient way. We conduct elaborate experiments to evaluate our method on common open-source models and compare it with baseline algorithms. Our code is available at this https URL.

Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2501.07959 [cs.AI]
	(or arXiv:2501.07959v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2501.07959

Submission history

From: Jiaqi Hua [view email]
[v1] Tue, 14 Jan 2025 09:23:30 UTC (293 KB)

Computer Science > Artificial Intelligence

Title:Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators