Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking

Wu, Yu-Hang; Xiong, Yu-Jie; Jie-Zhang

Computer Science > Cryptography and Security

arXiv:2504.05652 (cs)

[Submitted on 8 Apr 2025]

Title:Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking

Authors:Yu-Hang Wu, Yu-Jie Xiong, Jie-Zhang

View PDF HTML (experimental)

Abstract:Large Language Models (LLMs) have become increasingly integral to a wide range of applications. However, they still remain the threat of jailbreak attacks, where attackers manipulate designed prompts to make the models elicit malicious outputs. Analyzing jailbreak methods can help us delve into the weakness of LLMs and improve it. In this paper, We reveal a vulnerability in large language models (LLMs), which we term Defense Threshold Decay (DTD), by analyzing the attention weights of the model's output on input and subsequent output on prior output: as the model generates substantial benign content, its attention weights shift from the input to prior output, making it more susceptible to jailbreak attacks. To demonstrate the exploitability of DTD, we propose a novel jailbreak attack method, Sugar-Coated Poison (SCP), which induces the model to generate substantial benign content through benign input and adversarial reasoning, subsequently producing malicious content. To mitigate such attacks, we introduce a simple yet effective defense strategy, POSD, which significantly reduces jailbreak success rates while preserving the model's generalization capabilities.

Subjects:	Cryptography and Security (cs.CR); Computation and Language (cs.CL)
Cite as:	arXiv:2504.05652 [cs.CR]
	(or arXiv:2504.05652v1 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2504.05652

Submission history

From: Yuhang Wu Wu Yuhang [view email]
[v1] Tue, 8 Apr 2025 03:57:09 UTC (1,885 KB)

Computer Science > Cryptography and Security

Title:Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Cryptography and Security

Title:Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators