Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement

Kim, Heegyu; Yuk, Sehyun; Cho, Hyunsouk

Computer Science > Machine Learning

arXiv:2402.15180 (cs)

[Submitted on 23 Feb 2024 (v1), last revised 27 Feb 2024 (this version, v2)]

Title:Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement

Authors:Heegyu Kim, Sehyun Yuk, Hyunsouk Cho

View PDF HTML (experimental)

Abstract:Caution: This paper includes offensive words that could potentially cause unpleasantness. Language models (LMs) are vulnerable to exploitation for adversarial misuse. Training LMs for safety alignment is extensive and makes it hard to respond to fast-developing attacks immediately, such as jailbreaks. We propose self-refine with formatting that achieves outstanding safety even in non-safety-aligned LMs and evaluate our method alongside several defense baselines, demonstrating that it is the safest training-free method against jailbreak attacks. Additionally, we proposed a formatting method that improves the efficiency of the self-refine process while reducing attack success rates in fewer iterations. We've also observed that non-safety-aligned LMs outperform safety-aligned LMs in safety tasks by giving more helpful and safe responses. In conclusion, our findings can achieve less safety risk with fewer computational costs, allowing non-safety LM to be easily utilized in real-world service.

Comments:	under review
Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
Cite as:	arXiv:2402.15180 [cs.LG]
	(or arXiv:2402.15180v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2402.15180

Submission history

From: Heegyu Kim [view email]
[v1] Fri, 23 Feb 2024 08:22:24 UTC (4,352 KB)
[v2] Tue, 27 Feb 2024 01:39:20 UTC (4,352 KB)

Computer Science > Machine Learning

Title:Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators