Unifying Image Counterfactuals and Feature Attributions with Latent-Space Adversarial Attacks

Goldwasser, Jeremy; Hooker, Giles

Computer Science > Machine Learning

arXiv:2504.15479 (cs)

[Submitted on 21 Apr 2025]

Title:Unifying Image Counterfactuals and Feature Attributions with Latent-Space Adversarial Attacks

Authors:Jeremy Goldwasser, Giles Hooker

View PDF HTML (experimental)

Abstract:Counterfactuals are a popular framework for interpreting machine learning predictions. These what if explanations are notoriously challenging to create for computer vision models: standard gradient-based methods are prone to produce adversarial examples, in which imperceptible modifications to image pixels provoke large changes in predictions. We introduce a new, easy-to-implement framework for counterfactual images that can flexibly adapt to contemporary advances in generative modeling. Our method, Counterfactual Attacks, resembles an adversarial attack on the representation of the image along a low-dimensional manifold. In addition, given an auxiliary dataset of image descriptors, we show how to accompany counterfactuals with feature attribution that quantify the changes between the original and counterfactual images. These importance scores can be aggregated into global counterfactual explanations that highlight the overall features driving model predictions. While this unification is possible for any counterfactual method, it has particular computational efficiency for ours. We demonstrate the efficacy of our approach with the MNIST and CelebA datasets.

Subjects:	Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2504.15479 [cs.LG]
	(or arXiv:2504.15479v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2504.15479

Submission history

From: Jeremy Goldwasser [view email]
[v1] Mon, 21 Apr 2025 23:09:30 UTC (7,955 KB)

Computer Science > Machine Learning

Title:Unifying Image Counterfactuals and Feature Attributions with Latent-Space Adversarial Attacks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Unifying Image Counterfactuals and Feature Attributions with Latent-Space Adversarial Attacks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators