Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models

Zheng, Ziwei; Zhao, Junyao; Yang, Le; He, Lijun; Li, Fan

Computer Science > Machine Learning

arXiv:2501.02029 (cs)

[Submitted on 3 Jan 2025]

Title:Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models

Authors:Ziwei Zheng, Junyao Zhao, Le Yang, Lijun He, Fan Li

View PDF HTML (experimental)

Abstract:With the integration of an additional modality, large vision-language models (LVLMs) exhibit greater vulnerability to safety risks (e.g., jailbreaking) compared to their language-only predecessors. Although recent studies have devoted considerable effort to the post-hoc alignment of LVLMs, the inner safety mechanisms remain largely unexplored. In this paper, we discover that internal activations of LVLMs during the first token generation can effectively identify malicious prompts across different attacks. This inherent safety perception is governed by sparse attention heads, which we term ``safety heads." Further analysis reveals that these heads act as specialized shields against malicious prompts; ablating them leads to higher attack success rates, while the model's utility remains unaffected. By locating these safety heads and concatenating their activations, we construct a straightforward but powerful malicious prompt detector that integrates seamlessly into the generation process with minimal extra inference overhead. Despite its simple structure of a logistic regression model, the detector surprisingly exhibits strong zero-shot generalization capabilities. Experiments across various prompt-based attacks confirm the effectiveness of leveraging safety heads to protect LVLMs. Code is available at \url{this https URL}.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2501.02029 [cs.LG]
	(or arXiv:2501.02029v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2501.02029

Submission history

From: Ziwei Zheng [view email]
[v1] Fri, 3 Jan 2025 07:01:15 UTC (3,998 KB)

Computer Science > Machine Learning

Title:Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators