Adversarial Attacks on the Interpretation of Neuron Activation Maximization

Nanfack, Geraldin; Fulleringer, Alexander; Marty, Jonathan; Eickenberg, Michael; Belilovsky, Eugene

Computer Science > Machine Learning

arXiv:2306.07397 (cs)

[Submitted on 12 Jun 2023]

Title:Adversarial Attacks on the Interpretation of Neuron Activation Maximization

Authors:Geraldin Nanfack, Alexander Fulleringer, Jonathan Marty, Michael Eickenberg, Eugene Belilovsky

View PDF

Abstract:The internal functional behavior of trained Deep Neural Networks is notoriously difficult to interpret. Activation-maximization approaches are one set of techniques used to interpret and analyze trained deep-learning models. These consist in finding inputs that maximally activate a given neuron or feature map. These inputs can be selected from a data set or obtained by optimization. However, interpretability methods may be subject to being deceived. In this work, we consider the concept of an adversary manipulating a model for the purpose of deceiving the interpretation. We propose an optimization framework for performing this manipulation and demonstrate a number of ways that popular activation-maximization interpretation techniques associated with CNNs can be manipulated to change the interpretations, shedding light on the reliability of these methods.

Subjects:	Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2306.07397 [cs.LG]
	(or arXiv:2306.07397v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2306.07397

Submission history

From: Geraldin Nanfack [view email]
[v1] Mon, 12 Jun 2023 19:54:33 UTC (42,151 KB)

Computer Science > Machine Learning

Title:Adversarial Attacks on the Interpretation of Neuron Activation Maximization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Adversarial Attacks on the Interpretation of Neuron Activation Maximization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators