Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction

Huang, Yuting; Liu, Chengyuan; Feng, Yifeng; Wu, Chao; Wu, Fei; Kuang, Kun

Computer Science > Computation and Language

arXiv:2502.11084 (cs)

[Submitted on 16 Feb 2025]

Title:Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction

Authors:Yuting Huang, Chengyuan Liu, Yifeng Feng, Chao Wu, Fei Wu, Kun Kuang

View PDF HTML (experimental)

Abstract:As Large Language Models (LLMs) are widely applied in various domains, the safety of LLMs is increasingly attracting attention to avoid their powerful capabilities being misused. Existing jailbreak methods create a forced instruction-following scenario, or search adversarial prompts with prefix or suffix tokens to achieve a specific representation manually or automatically. However, they suffer from low efficiency and explicit jailbreak patterns, far from the real deployment of mass attacks to LLMs. In this paper, we point out that simply rewriting the original instruction can achieve a jailbreak, and we find that this rewriting approach is learnable and transferable. We propose the Rewrite to Jailbreak (R2J) approach, a transferable black-box jailbreak method to attack LLMs by iteratively exploring the weakness of the LLMs and automatically improving the attacking strategy. The jailbreak is more efficient and hard to identify since no additional features are introduced. Extensive experiments and analysis demonstrate the effectiveness of R2J, and we find that the jailbreak is also transferable to multiple datasets and various types of models with only a few queries. We hope our work motivates further investigation of LLM safety.

Comments:	21pages, 10 figures
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2502.11084 [cs.CL]
	(or arXiv:2502.11084v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.11084

Submission history

From: Yuting Huang [view email]
[v1] Sun, 16 Feb 2025 11:43:39 UTC (4,980 KB)

Computer Science > Computation and Language

Title:Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators