Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods

Slack, Dylan; Hilgard, Sophie; Jia, Emily; Singh, Sameer; Lakkaraju, Himabindu

doi:10.48550/arxiv.1911.02508

Cited by 14 publications

(19 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…LIME, and to a lesser extent SHAP, have been demonstrated to provide unreliable interpretations in some cases. For instance, LIME is strongly influenced by the chosen kernel width parameter (Slack et al 2019). In Section 6, we compare our new class of fMEs to LIME.…”

Section: Interpretable Machine Learningmentioning

confidence: 99%

“…In addition to the sensitivity of results regarding parameter choices (Slack et al 2019), LIME is notoriously unstable even with fixed parameters. Zhou et al (2021) note that repeated runs using the same explanation algorithm on the same model for the same observation results in different model explanations, and they suggest significance testing as a remedy.…”

Section: Interpretation and Confidence Intervalsmentioning

confidence: 99%

See 1 more Smart Citation

Marginal Effects for Non-Linear Prediction Functions

Scholbeck¹,

Casalicchio²,

Molnar³

et al. 2022

Preprint

View full text Add to dashboard Cite

Beta coefficients for linear regression models represent the ideal form of an interpretable feature effect. However, for non-linear models and especially generalized linear models, the estimated coefficients cannot be interpreted as a direct feature effect on the predicted outcome. Hence, marginal effects are typically used as approximations for feature effects, either in the shape of derivatives of the prediction function or forward differences in prediction due to a change in a feature value. While marginal effects are commonly used in many scientific fields, they have not yet been adopted as a modelagnostic interpretation method for machine learning models. This may stem from their inflexibility as a univariate feature effect and their inability to deal with the non-linearities found in black box models. We introduce a new class of marginal effects termed forward marginal effects. We argue to abandon derivatives in favor of better-interpretable forward differences. Furthermore, we generalize marginal effects based on forward differences to multivariate changes in feature values. To account for the non-linearity of prediction functions, we introduce a non-linearity measure for marginal effects. We argue against summarizing feature effects of a non-linear prediction function in a single metric such as the average marginal effect. Instead, we propose to partition the feature space to compute conditional average marginal effects on feature subspaces, which serve as conditional feature effect estimates. This work has been partially supported by the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IS18036A. The authors of this work take full responsibilities for its content. We thank the anonymous reviewers for their constructive comments, specifically on structuring the paper, on the line integral for the non-linearity measure, and on the instabilities of decision trees.

show abstract

Section: Interpretable Machine Learningmentioning

confidence: 99%

Section: Interpretation and Confidence Intervalsmentioning

confidence: 99%

Marginal Effects for Non-Linear Prediction Functions

Scholbeck¹,

Casalicchio²,

Molnar³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…One might intuit that a post-hoc explanation would never lead to a worse decision than one made using the same underlying model absent explanation. Recent research however has shown that not only are XAI methods innocuously fragile in practice [43,47], they are also susceptible to adversarial intervention [1,46,71]. In additional to these algorithmic issues, irreducible cognitive factors and intrinsic human biases [23,31,32] can perpetuate harmful effects in any algorithmically aided decision making context (explanations or not).…”

Section: Axiomatic Assumptionsmentioning

confidence: 99%

Challenging common interpretability assumptions in feature attribution explanations

Dinu,

Bigham,

Kolter

2020

Preprint

View full text Add to dashboard Cite

As machine learning and algorithmic decision making systems are increasingly being leveraged in high-stakes human-in-the-loop settings, there is a pressing need to understand the rationale of their predictions. Researchers have responded to this need with explainable AI (XAI), but often proclaim interpretability axiomatically without evaluation. When these systems are evaluated, they are often tested through offline simulations with proxy metrics of interpretability (such as model complexity). We empirically evaluate the veracity of three common interpretability assumptions through a large scale human-subjects experiment with a simple "placebo explanation" control. We find that feature attribution explanations provide marginal utility in our task for a human decision maker and in certain cases result in worse decisions due to cognitive and contextual confounders. This result challenges the assumed universal benefit of applying these methods and we hope this work will underscore the importance of human evaluation in XAI research. Supplemental materials-including anonymized data from the experiment, code to replicate the study, an interactive demo of the experiment, and the models used in the analysis-can be found at: https://doi.pizza/challenging-xai.

show abstract

“…These methods arose in computer vision and have demonstrated empirical utility in producing nonlinear factor models where the factors are conceptually sensible. Yet, due to the black-box nature of deep learning, explanations for how the factors are generated from the data, using local saliency maps for instance, are unreliable or imprecise (Laugel et al, 2019;Slack et al, 2020;Arun et al, 2020). In imaging applications, where the features are raw pixels, this type of interpretability is unnecessary.…”

Section: Disentangled Autoencodersmentioning

confidence: 99%

Sparse encoding for more-interpretable feature-selecting representations in probabilistic matrix factorization

Chang,

Fletcher,

Han

et al. 2020

Preprint

View full text Add to dashboard Cite

Dimensionality reduction methods for count data are critical to a wide range of applications in medical informatics and other fields where model interpretability is paramount. For such data, hierarchical Poisson matrix factorization (HPF) and other sparse probabilistic non-negative matrix factorization (NMF) methods are considered to be interpretable generative models. They consist of sparse transformations for decoding their learned representations into predictions. However, sparsity in representation decoding does not necessarily imply sparsity in the encoding of representations from the original data features. HPF is often incorrectly interpreted in the literature as if it possesses encoder sparsity. The distinction between decoder sparsity and encoder sparsity is subtle but important. Due to the lack of encoder sparsity, HPF does not possess the column-clustering property of classical NMF -the factor loading matrix does not sufficiently define how each factor is formed from the original features. We address this deficiency by self-consistently enforcing encoder sparsity, using a generalized additive model (GAM), thereby allowing one to relate each representation coordinate to a subset of the original data features. In doing so, the method also gains the ability to perform feature selection. We demonstrate our method on simulated data and give an example of how encoder sparsity is of practical use in a concrete application of representing inpatient comorbidities in Medicare patients.

show abstract

Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods

Cited by 14 publications

References 0 publications

Marginal Effects for Non-Linear Prediction Functions

Marginal Effects for Non-Linear Prediction Functions

Challenging common interpretability assumptions in feature attribution explanations

Sparse encoding for more-interpretable feature-selecting representations in probabilistic matrix factorization

Contact Info

Product

Resources

About