Causal Graph Guided Steering of LLM Values via Prompts and Sparse Autoencoders

Kang, Yipeng; Wang, Junqi; Li, Yexin; Zhong, Fangwei; Feng, Xue; Wang, Mengmeng; Tu, Wenming; Wang, Quansen; Li, Hengli; Zheng, Zilong

Computer Science > Computation and Language

arXiv:2501.00581 (cs)

[Submitted on 31 Dec 2024]

Title:Causal Graph Guided Steering of LLM Values via Prompts and Sparse Autoencoders

Authors:Yipeng Kang, Junqi Wang, Yexin Li, Fangwei Zhong, Xue Feng, Mengmeng Wang, Wenming Tu, Quansen Wang, Hengli Li, Zilong Zheng

View PDF HTML (experimental)

Abstract:As large language models (LLMs) become increasingly integrated into critical applications, aligning their behavior with human values presents significant challenges. Current methods, such as Reinforcement Learning from Human Feedback (RLHF), often focus on a limited set of values and can be resource-intensive. Furthermore, the correlation between values has been largely overlooked and remains underutilized. Our framework addresses this limitation by mining a causal graph that elucidates the implicit relationships among various values within the LLMs. Leveraging the causal graph, we implement two lightweight mechanisms for value steering: prompt template steering and Sparse Autoencoder feature steering, and analyze the effects of altering one value dimension on others. Extensive experiments conducted on Gemma-2B-IT and Llama3-8B-IT demonstrate the effectiveness and controllability of our steering methods.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2501.00581 [cs.CL]
	(or arXiv:2501.00581v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2501.00581

Submission history

From: Yipeng Kang [view email]
[v1] Tue, 31 Dec 2024 18:12:05 UTC (1,769 KB)

Computer Science > Computation and Language

Title:Causal Graph Guided Steering of LLM Values via Prompts and Sparse Autoencoders

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Causal Graph Guided Steering of LLM Values via Prompts and Sparse Autoencoders

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators