Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.432
|View full text |Cite
|
Sign up to set email alerts
|

Learning to Deceive with Attention-Based Explanations

Abstract: Attention mechanisms are ubiquitous components in neural architectures applied to natural language processing. In addition to yielding gains in predictive accuracy, attention weights are often claimed to confer interpretability, purportedly useful both for providing insights to practitioners and for explaining why a model makes its decisions to stakeholders. We call the latter use of attention mechanisms into question by demonstrating a simple method for training models to produce deceptive attention masks. Ou… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
113
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 127 publications
(114 citation statements)
references
References 15 publications
1
113
0
Order By: Relevance
“…for all examples in the validation set. This metric estimates the extent to which the model is attributing its predictions to gender (an unbiased model should have less of this attribution), and is similar to the measure of bias used by Pruthi et al (2020).…”
Section: Gradientmentioning
confidence: 99%
See 3 more Smart Citations
“…for all examples in the validation set. This metric estimates the extent to which the model is attributing its predictions to gender (an unbiased model should have less of this attribution), and is similar to the measure of bias used by Pruthi et al (2020).…”
Section: Gradientmentioning
confidence: 99%
“…We instead manipulate explanation methods to evaluate the extent to which a model's true reasoning can be hidden. Pruthi et al (2020) manipulate attention distributions in an end-to-end fashion; we focus on manipulating gradients. It is worth noting that we perturb models to manipulate interpretations; other work perturbs inputs (Ghorbani et al, 2019;Dombrowski et al, 2019;Subramanya et al, 2019).…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…As for human consuming attention as explanation, there has been criticism that unsupervised attention weights are too poorly correlated with the contribution of each word for machine decision (or, unfaithful) (Jain and Wallace, 2019;Serrano and Smith, 2019;Pruthi et al, 2019). Meanwhile, (Wiegreffe and Pinter, 2019) develops diagnostics to decide when attention is good enough as explanation.…”
Section: Attention To/from Humanmentioning
confidence: 99%