Fooling Neural Network Interpretations via Adversarial Model Manipulation

Heo, Juyeon; Joo, Sunghwan; Moon, Taesup

doi:10.48550/arxiv.1902.02041

Cited by 11 publications

(12 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While adversarial examples for classification are wellknown, recently there has been growing interest in adversarial manipulation of explanations (Ghorbani et al, 2019;Heo et al, 2019;Dombrowski et al, 2019). Attacks on explanation can serve multiple purposes including "fairwashing" (Aïvodji et al, 2019).…”

Section: Adversarial Attacks On Explanation Methodsmentioning

confidence: 99%

“…The goal is to manipulate the explanation while keeping the input and output (visually) similar. It is assumed that the network architecture and weights are known and that either the input ( (Ghorbani et al, 2019;Zhang et al, 2018b;Dombrowski et al, 2019)) or the network weights (Heo et al, 2019) can be changed by the attacker.…”

Section: Adversarial Attacks On Explanation Methodsmentioning

confidence: 99%

“…Interestingly, (Zhang et al, 2018b) discuss the transferability of attacks and conclude that attacks are not that transferable. If the attacker is allowed to modify networks weights, as in (Heo et al, 2019), the attacks generalize to all the considered explanation methods. (Zhang et al, 2018b) also discuss various strategies where the adversarial attacks the label without modifying the explanation and vice versa.…”

Section: Adversarial Attacks On Explanation Methodsmentioning

confidence: 99%

See 2 more Smart Citations

Aggregating explanation methods for stable and robust explainability

Rieger¹,

Hansen²

2019

Preprint

View full text Add to dashboard Cite

Despite a growing literature on explaining neural networks, no consensus has been reached on how to explain a neural network decision or how to evaluate an explanation. Our contributions in this paper are twofold. First, we investigate schemes to combine explanation methods and reduce model uncertainty to obtain a single aggregated explanation. We provide evidence that the aggregation is better at identifying important features, than on individual methods. Adversarial attacks on explanations is a recent active research topic. As our second contribution, we present evidence that aggregate explanations are much more robust to attacks than individual explanation methods.

show abstract

Section: Adversarial Attacks On Explanation Methodsmentioning

confidence: 99%

Section: Adversarial Attacks On Explanation Methodsmentioning

confidence: 99%

Section: Adversarial Attacks On Explanation Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Aggregating explanation methods for stable and robust explainability

Rieger¹,

Hansen²

2019

Preprint

View full text Add to dashboard Cite

show abstract

“…The above techniques can be applied on an ensemble of models, and thus increase the probability of creating a transferable attack, or even a universal attack [23]. Some attacks target other aspects of NN computation; for example, they attempt to change the heatmaps produced by various interpretation methods [24], [25], or attack through model manipulation [26] or through poisoning the training data [27], [28] rather than through input perturbations.…”

Section: A Adversarial Attacksmentioning

confidence: 99%

Evaluation of Neural Networks Defenses and Attacks using NDCG and Reciprocal Rank Metrics

Brama,

Dery,

Grinshpoun

2022

Preprint

View full text Add to dashboard Cite

The problem of attacks on neural networks through input modification (i.e., adversarial examples) has attracted much attention recently. Being relatively easy to generate and hard to detect, these attacks pose a security breach that many suggested defenses try to mitigate. However, the evaluation of the effect of attacks and defenses commonly relies on traditional classification metrics, without adequate adaptation to adversarial scenarios. Most of these metrics are accuracy-based, and therefore may have a limited scope and low distinctive power. Other metrics do not consider the unique characteristics of neural networks functionality, or measure the effect of the attacks indirectly (e.g., through the complexity of their generation). In this paper, we present two metrics which are specifically designed to measure the effect of attacks, or the recovery effect of defenses, on the output of neural networks in multiclass classification tasks. Inspired by the normalized discounted cumulative gain and the reciprocal rank metrics used in information retrieval literature, we treat the neural network predictions as ranked lists of results. Using additional information about the probability of the rank enabled us to define novel metrics that are suited to the task at hand. We evaluate our metrics using various attacks and defenses on a pretrained VGG19 model and the ImageNet dataset. Compared to the common classification metrics, our proposed metrics demonstrate superior informativeness and distinctiveness.

show abstract

“…Given lack of consensus around benchmarking interpretability, recent methods have proposed re-training with feature ablation, among others [13,28]. Recent research has also tested the robustness of interpretability to adversarial attacks, model randomization, and input perturbation [1,11,22]. These existing benchmarks are not directly relevant to semantic interpretability use cases.…”

Section: Related Workmentioning

confidence: 99%

Robust Semantic Interpretability: Revisiting Concept Activation Vectors

Pfau¹,

Young²,

Wei³

et al. 2021

Preprint

View full text Add to dashboard Cite

Interpretability methods for image classification assess model trustworthiness by attempting to expose whether the model is systematically biased or attending to the same cues as a human would. Saliency methods for feature attribution dominate the interpretability literature, but these methods do not address semantic concepts such as the textures, colors, or genders of objects within an image. Our proposed Robust Concept Activation Vectors (RCAV) quantifies the effects of semantic concepts on individual model predictions and on model behavior as a whole. RCAV calculates a concept gradient and takes a gradient ascent step to assess model sensitivity to the given concept. By generalizing previous work on concept activation vectors to account for model non-linearity, and by introducing stricter hypothesis testing, we show that RCAV yields interpretations which are both more accurate at the image level and robust at the dataset level. RCAV, like saliency methods, supports the interpretation of individual predictions. To evaluate the practical use of interpretability methods as debugging tools, and the scientific use of interpretability methods for identifying inductive biases (e.g. texture over shape), we construct two datasets and accompanying metrics for realistic benchmarking of semantic interpretability methods. Our benchmarks expose the importance of counterfactual augmentation and negative controls for quantifying the practical usability of interpretability methods. 1

show abstract

Fooling Neural Network Interpretations via Adversarial Model Manipulation

Cited by 11 publications

References 26 publications

Aggregating explanation methods for stable and robust explainability

Aggregating explanation methods for stable and robust explainability

Evaluation of Neural Networks Defenses and Attacks using NDCG and Reciprocal Rank Metrics

Robust Semantic Interpretability: Revisiting Concept Activation Vectors

Contact Info

Product

Resources

About