The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence

Wollschläger, Tom; Elstner, Jannes; Geisler, Simon; Cohen-Addad, Vincent; Günnemann, Stephan; Gasteiger, Johannes

Computer Science > Machine Learning

arXiv:2502.17420 (cs)

[Submitted on 24 Feb 2025]

Title:The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence

Authors:Tom Wollschläger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, Johannes Gasteiger

View PDF HTML (experimental)

Abstract:The safety alignment of large language models (LLMs) can be circumvented through adversarially crafted inputs, yet the mechanisms by which these attacks bypass safety barriers remain poorly understood. Prior work suggests that a single refusal direction in the model's activation space determines whether an LLM refuses a request. In this study, we propose a novel gradient-based approach to representation engineering and use it to identify refusal directions. Contrary to prior work, we uncover multiple independent directions and even multi-dimensional concept cones that mediate refusal. Moreover, we show that orthogonality alone does not imply independence under intervention, motivating the notion of representational independence that accounts for both linear and non-linear effects. Using this framework, we identify mechanistically independent refusal directions. We show that refusal mechanisms in LLMs are governed by complex spatial structures and identify functionally independent directions, confirming that multiple distinct mechanisms drive refusal behavior. Our gradient-based approach uncovers these mechanisms and can further serve as a foundation for future work on understanding LLMs.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2502.17420 [cs.LG]
	(or arXiv:2502.17420v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2502.17420

Submission history

From: Tom Wollschläger [view email]
[v1] Mon, 24 Feb 2025 18:52:59 UTC (2,628 KB)

Computer Science > Machine Learning

Title:The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators