Interpreting and Steering Protein Language Models through Sparse Autoencoders

Garcia, Edith Natalia Villegas; Ansuini, Alessio

Computer Science > Machine Learning

arXiv:2502.09135 (cs)

[Submitted on 13 Feb 2025]

Title:Interpreting and Steering Protein Language Models through Sparse Autoencoders

Authors:Edith Natalia Villegas Garcia, Alessio Ansuini

View PDF HTML (experimental)

Abstract:The rapid advancements in transformer-based language models have revolutionized natural language processing, yet understanding the internal mechanisms of these models remains a significant challenge. This paper explores the application of sparse autoencoders (SAE) to interpret the internal representations of protein language models, specifically focusing on the ESM-2 8M parameter model. By performing a statistical analysis on each latent component's relevance to distinct protein annotations, we identify potential interpretations linked to various protein characteristics, including transmembrane regions, binding sites, and specialized motifs.
We then leverage these insights to guide sequence generation, shortlisting the relevant latent components that can steer the model towards desired targets such as zinc finger domains. This work contributes to the emerging field of mechanistic interpretability in biological sequence models, offering new perspectives on model steering for sequence design.

Comments:	11 pages, 6 figures
Subjects:	Machine Learning (cs.LG); Biomolecules (q-bio.BM)
Cite as:	arXiv:2502.09135 [cs.LG]
	(or arXiv:2502.09135v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2502.09135

Submission history

From: Alessio Ansuini PhD [view email]
[v1] Thu, 13 Feb 2025 10:11:36 UTC (1,254 KB)

Computer Science > Machine Learning

Title:Interpreting and Steering Protein Language Models through Sparse Autoencoders

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Interpreting and Steering Protein Language Models through Sparse Autoencoders

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators