The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis

Pan, Wenbo; Liu, Zhichao; Chen, Qiguang; Zhou, Xiangyang; Yu, Haining; Jia, Xiaohua

Computer Science > Computation and Language

arXiv:2502.09674 (cs)

[Submitted on 13 Feb 2025 (v1), last revised 18 Feb 2025 (this version, v2)]

Title:The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis

Authors:Wenbo Pan, Zhichao Liu, Qiguang Chen, Xiangyang Zhou, Haining Yu, Xiaohua Jia

View PDF HTML (experimental)

Abstract:Large Language Models' safety-aligned behaviors, such as refusing harmful queries, can be represented by linear directions in activation space. Previous research modeled safety behavior with a single direction, limiting mechanistic understanding to an isolated safety feature. In this work, we discover that safety-aligned behavior is jointly controlled by multi-dimensional directions. Namely, we study the vector space of representation shifts during safety fine-tuning on Llama 3 8B for refusing jailbreaks. By studying orthogonal directions in the space, we first find that a dominant direction governs the model's refusal behavior, while multiple smaller directions represent distinct and interpretable features like hypothetical narrative and role-playing. We then measure how different directions promote or suppress the dominant direction, showing the important role of secondary directions in shaping the model's refusal representation. Finally, we demonstrate that removing certain trigger tokens in harmful queries can mitigate these directions to bypass the learned safety capability, providing new insights on understanding safety alignment vulnerability from a multi-dimensional perspective. Code and artifacts are available at this https URL.

Comments:	Code and artifacts: this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2502.09674 [cs.CL]
	(or arXiv:2502.09674v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.09674

Submission history

From: Wenbo Pan [view email]
[v1] Thu, 13 Feb 2025 06:39:22 UTC (6,962 KB)
[v2] Tue, 18 Feb 2025 03:24:45 UTC (6,946 KB)

Computer Science > Computation and Language

Title:The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators