Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks

Yi, Xin; Li, Yue; Wang, Linlin; Wang, Xiaoling; He, Liang

Computer Science > Cryptography and Security

arXiv:2501.10639 (cs)

[Submitted on 18 Jan 2025]

Title:Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks

Authors:Xin Yi, Yue Li, Linlin Wang, Xiaoling Wang, Liang He

View PDF HTML (experimental)

Abstract:Ensuring safety alignment has become a critical requirement for large language models (LLMs), particularly given their widespread deployment in real-world applications. However, LLMs remain susceptible to jailbreak attacks, which exploit system vulnerabilities to bypass safety measures and generate harmful outputs. Although numerous defense mechanisms based on adversarial training have been proposed, a persistent challenge lies in the exacerbation of over-refusal behaviors, which compromise the overall utility of the model. To address these challenges, we propose a Latent-space Adversarial Training with Post-aware Calibration (LATPC) framework. During the adversarial training phase, LATPC compares harmful and harmless instructions in the latent space and extracts safety-critical dimensions to construct refusal features attack, precisely simulating agnostic jailbreak attack types requiring adversarial mitigation. At the inference stage, an embedding-level calibration mechanism is employed to alleviate over-refusal behaviors with minimal computational overhead. Experimental results demonstrate that, compared to various defense methods across five types of jailbreak attacks, LATPC framework achieves a superior balance between safety and utility. Moreover, our analysis underscores the effectiveness of extracting safety-critical dimensions from the latent space for constructing robust refusal feature attacks.

Comments:	Under Review
Subjects:	Cryptography and Security (cs.CR); Computation and Language (cs.CL)
Cite as:	arXiv:2501.10639 [cs.CR]
	(or arXiv:2501.10639v1 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2501.10639

Submission history

From: Xin Yi [view email]
[v1] Sat, 18 Jan 2025 02:57:12 UTC (2,042 KB)

Computer Science > Cryptography and Security

Title:Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Cryptography and Security

Title:Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators