Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking

Zhu, Junda; Yan, Lingyong; Wang, Shuaiqiang; Yin, Dawei; Sha, Lei

Computer Science > Computation and Language

arXiv:2502.12970 (cs)

[Submitted on 18 Feb 2025]

Title:Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking

Authors:Junda Zhu, Lingyong Yan, Shuaiqiang Wang, Dawei Yin, Lei Sha

View PDF HTML (experimental)

Abstract:The reasoning abilities of Large Language Models (LLMs) have demonstrated remarkable advancement and exceptional performance across diverse domains. However, leveraging these reasoning capabilities to enhance LLM safety against adversarial attacks and jailbreak queries remains largely unexplored. To bridge this gap, we propose Reasoning-to-Defend (R2D), a novel training paradigm that integrates safety reflections of queries and responses into LLMs' generation process, unlocking a safety-aware reasoning mechanism. This approach enables self-evaluation at each reasoning step to create safety pivot tokens as indicators of the response's safety status. Furthermore, in order to improve the learning efficiency of pivot token prediction, we propose Contrastive Pivot Optimization(CPO), which enhances the model's ability to perceive the safety status of dialogues. Through this mechanism, LLMs dynamically adjust their response strategies during reasoning, significantly enhancing their defense capabilities against jailbreak attacks. Extensive experimental results demonstrate that R2D effectively mitigates various attacks and improves overall safety, highlighting the substantial potential of safety-aware reasoning in strengthening LLMs' robustness against jailbreaks.

Comments:	18 pages
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2502.12970 [cs.CL]
	(or arXiv:2502.12970v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.12970

Submission history

From: Junda Zhu [view email]
[v1] Tue, 18 Feb 2025 15:48:46 UTC (857 KB)

Computer Science > Computation and Language

Title:Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators