2019
DOI: 10.48550/arxiv.1902.02041
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Fooling Neural Network Interpretations via Adversarial Model Manipulation

Abstract: We ask whether the neural network interpretation methods can be fooled via adversarial model manipulation, which is defined as a model fine-tuning step that aims to radically alter the explanations without hurting the accuracy of the original models, e.g., VGG19, ResNet50, and DenseNet121. By incorporating the interpretation results directly in the penalty term of the objective function for finetuning, we show that the state-of-the-art saliency map based interpreters, e.g., LRP, Grad-CAM, and SimpleGrad, can b… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
12
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 11 publications
(12 citation statements)
references
References 26 publications
0
12
0
Order By: Relevance
“…While adversarial examples for classification are wellknown, recently there has been growing interest in adversarial manipulation of explanations (Ghorbani et al, 2019;Heo et al, 2019;Dombrowski et al, 2019). Attacks on explanation can serve multiple purposes including "fairwashing" (Aïvodji et al, 2019).…”
Section: Adversarial Attacks On Explanation Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…While adversarial examples for classification are wellknown, recently there has been growing interest in adversarial manipulation of explanations (Ghorbani et al, 2019;Heo et al, 2019;Dombrowski et al, 2019). Attacks on explanation can serve multiple purposes including "fairwashing" (Aïvodji et al, 2019).…”
Section: Adversarial Attacks On Explanation Methodsmentioning
confidence: 99%
“…The goal is to manipulate the explanation while keeping the input and output (visually) similar. It is assumed that the network architecture and weights are known and that either the input ( (Ghorbani et al, 2019;Zhang et al, 2018b;Dombrowski et al, 2019)) or the network weights (Heo et al, 2019) can be changed by the attacker.…”
Section: Adversarial Attacks On Explanation Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…The above techniques can be applied on an ensemble of models, and thus increase the probability of creating a transferable attack, or even a universal attack [23]. Some attacks target other aspects of NN computation; for example, they attempt to change the heatmaps produced by various interpretation methods [24], [25], or attack through model manipulation [26] or through poisoning the training data [27], [28] rather than through input perturbations.…”
Section: A Adversarial Attacksmentioning
confidence: 99%
“…Given lack of consensus around benchmarking interpretability, recent methods have proposed re-training with feature ablation, among others [13,28]. Recent research has also tested the robustness of interpretability to adversarial attacks, model randomization, and input perturbation [1,11,22]. These existing benchmarks are not directly relevant to semantic interpretability use cases.…”
Section: Related Workmentioning
confidence: 99%