Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders

Surkov, Viacheslav; Wendler, Chris; Terekhov, Mikhail; Deschenaux, Justin; West, Robert; Gulcehre, Caglar

Computer Science > Machine Learning

arXiv:2410.22366 (cs)

[Submitted on 28 Oct 2024]

Title:Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders

Authors:Viacheslav Surkov, Chris Wendler, Mikhail Terekhov, Justin Deschenaux, Robert West, Caglar Gulcehre

View PDF

Abstract:Sparse autoencoders (SAEs) have become a core ingredient in the reverse engineering of large-language models (LLMs). For LLMs, they have been shown to decompose intermediate representations that often are not interpretable directly into sparse sums of interpretable features, facilitating better control and subsequent analysis. However, similar analyses and approaches have been lacking for text-to-image models. We investigated the possibility of using SAEs to learn interpretable features for a few-step text-to-image diffusion models, such as SDXL Turbo. To this end, we train SAEs on the updates performed by transformer blocks within SDXL Turbo's denoising U-net. We find that their learned features are interpretable, causally influence the generation process, and reveal specialization among the blocks. In particular, we find one block that deals mainly with image composition, one that is mainly responsible for adding local details, and one for color, illumination, and style. Therefore, our work is an important first step towards better understanding the internals of generative text-to-image models like SDXL Turbo and showcases the potential of features learned by SAEs for the visual domain.
Code is available at this https URL

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2410.22366 [cs.LG]
	(or arXiv:2410.22366v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2410.22366

Submission history

From: Viacheslav Surkov [view email]
[v1] Mon, 28 Oct 2024 19:01:18 UTC (24,665 KB)

Computer Science > Machine Learning

Title:Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Unpacking SDXL Turbo: Interpreting Text-to-Image Models with Sparse Autoencoders

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators