Steering Large Language Models with Feature Guided Activation Additions

Soo, Samuel; Teng, Wesley; Balaganesh, Chandrasekaran

Computer Science > Machine Learning

arXiv:2501.09929 (cs)

[Submitted on 17 Jan 2025 (v1), last revised 20 Jan 2025 (this version, v2)]

Title:Steering Large Language Models with Feature Guided Activation Additions

Authors:Samuel Soo, Wesley Teng, Chandrasekaran Balaganesh

View PDF HTML (experimental)

Abstract:Effective and reliable control over large language model (LLM) behavior is a significant challenge. While activation steering methods, which add steering vectors to a model's hidden states, are a promising approach, existing techniques often lack precision and interpretability in how they influence model outputs. We introduce Feature Guided Activation Additions (FGAA), a novel activation steering method that leverages insights from Contrastive Activation Addition (CAA) and Sparse Autoencoder-Targeted Steering (SAE-TS). By operating in the latent space of a Sparse Autoencoder (SAE) and employing optimization techniques to select desired SAE features, FGAA constructs precise steering vectors that provide better steering effects while maintaining coherence of steered model outputs. In this regard, evaluations on Gemma-2-2B and Gemma-2-9B models across various steering tasks demonstrate that FGAA outperforms existing steering methods of CAA, SAE decoder steering, and SAE-TS. Our results also highlight important trade-offs between steering scale and general model capabilities that are consistent across all tested steering methods.

Comments:	7 maintext pages, 14 appendix pages
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2501.09929 [cs.LG]
	(or arXiv:2501.09929v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2501.09929

Submission history

From: Samuel Soo [view email]
[v1] Fri, 17 Jan 2025 02:55:23 UTC (2,393 KB)
[v2] Mon, 20 Jan 2025 02:51:47 UTC (2,394 KB)

Computer Science > Machine Learning

Title:Steering Large Language Models with Feature Guided Activation Additions

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Steering Large Language Models with Feature Guided Activation Additions

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators