Sparse Autoencoder Features for Classifications and Transferability

Gallifant, Jack; Chen, Shan; Sasse, Kuleen; Aerts, Hugo; Hartvigsen, Thomas; Bitterman, Danielle S.

Computer Science > Machine Learning

arXiv:2502.11367 (cs)

[Submitted on 17 Feb 2025]

Title:Sparse Autoencoder Features for Classifications and Transferability

Authors:Jack Gallifant, Shan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, Danielle S. Bitterman

View PDF HTML (experimental)

Abstract:Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze SAE for interpretable feature extraction from LLMs in safety-critical classification tasks. Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations. SAE-derived features achieve macro F1 > 0.8, outperforming hidden-state and BoW baselines while demonstrating cross-model transfer from Gemma 2 2B to 9B-IT models. These features generalize in a zero-shot manner to cross-lingual toxicity detection and visual classification tasks. Our analysis highlights the significant impact of pooling strategies and binarization thresholds, showing that binarization offers an efficient alternative to traditional feature selection while maintaining or improving performance. These findings establish new best practices for SAE-based interpretability and enable scalable, transparent deployment of LLMs in real-world applications. Full repo: this https URL.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2502.11367 [cs.LG]
	(or arXiv:2502.11367v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2502.11367

Submission history

From: Jack Gallifant [view email]
[v1] Mon, 17 Feb 2025 02:30:45 UTC (3,191 KB)

Computer Science > Machine Learning

Title:Sparse Autoencoder Features for Classifications and Transferability

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Sparse Autoencoder Features for Classifications and Transferability

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators