Sparse Auto-Encoder Interprets Linguistic Features in Large Language Models

Jing, Yi; Yao, Zijun; Ran, Lingxu; Guo, Hongzhu; Wang, Xiaozhi; Hou, Lei; Li, Juanzi

Abstract:Large language models (LLMs) excel in tasks that require complex linguistic abilities, such as reference disambiguation and metaphor recognition/generation. Although LLMs possess impressive capabilities, their internal mechanisms for processing and representing linguistic knowledge remain largely opaque. Previous work on linguistic mechanisms has been limited by coarse granularity, insufficient causal analysis, and a narrow focus. In this study, we present a systematic and comprehensive causal investigation using sparse auto-encoders (SAEs). We extract a wide range of linguistic features from six dimensions: phonetics, phonology, morphology, syntax, semantics, and pragmatics. We extract, evaluate, and intervene on these features by constructing minimal contrast datasets and counterfactual sentence datasets. We introduce two indices-Feature Representation Confidence (FRC) and Feature Intervention Confidence (FIC)-to measure the ability of linguistic features to capture and control linguistic phenomena. Our results reveal inherent representations of linguistic knowledge in LLMs and demonstrate the potential for controlling model outputs. This work provides strong evidence that LLMs possess genuine linguistic knowledge and lays the foundation for more interpretable and controllable language modeling in future research.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2502.20344 [cs.CL]
	(or arXiv:2502.20344v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2502.20344

Computer Science > Computation and Language

Title:Sparse Auto-Encoder Interprets Linguistic Features in Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators