LEACE: Perfect linear concept erasure in closed form

Belrose, Nora; Schneider-Joseph, David; Ravfogel, Shauli; Cotterell, Ryan; Raff, Edward; Biderman, Stella

Computer Science > Machine Learning

arXiv:2306.03819v3 (cs)

[Submitted on 6 Jun 2023 (v1), last revised 29 Oct 2023 (this version, v3)]

Title:LEACE: Perfect linear concept erasure in closed form

Authors:Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, Stella Biderman

View PDF

Abstract:Concept erasure aims to remove specified features from a representation. It can improve fairness (e.g. preventing a classifier from using gender or race) and interpretability (e.g. removing a concept to observe changes in model behavior). We introduce LEAst-squares Concept Erasure (LEACE), a closed-form method which provably prevents all linear classifiers from detecting a concept while changing the representation as little as possible, as measured by a broad class of norms. We apply LEACE to large language models with a novel procedure called "concept scrubbing," which erases target concept information from every layer in the network. We demonstrate our method on two tasks: measuring the reliance of language models on part-of-speech information, and reducing gender bias in BERT embeddings. Code is available at this https URL.

Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)
Cite as:	arXiv:2306.03819 [cs.LG]
	(or arXiv:2306.03819v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2306.03819

Submission history

From: Nora Belrose [view email]
[v1] Tue, 6 Jun 2023 16:07:24 UTC (92 KB)
[v2] Fri, 23 Jun 2023 00:16:46 UTC (164 KB)
[v3] Sun, 29 Oct 2023 21:41:46 UTC (197 KB)

Computer Science > Machine Learning

Title:LEACE: Perfect linear concept erasure in closed form

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:LEACE: Perfect linear concept erasure in closed form

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators