Do explanations make VQA models more predictable to a human?

Chandrasekaran, Arjun; Prabhu, Viraj; Yadav, Deshraj; Chattopadhyay, Prithvijit; Parikh, Devi

doi:10.18653/v1/d18-1128

Cited by 49 publications

(69 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Even for well-defined tasks such as VQA, answers to questions like “Is it sunny?” can be inferred using multiple image regions. Indeed, inclusion of attention maps does not make a model more predictable for human observers (Chandrasekaran et al, 2018 ), and the attention-based models and humans do not look at same image regions (Das et al, 2016 ). This suggests attention maps are an unreliable means of conveying interpretable predictions.…”

Section: Shortcomings Of Vandl Researchmentioning

confidence: 99%

“…However, learning to predict explanations can suffer from many of the same problems faced by image captioning: evaluation is difficult and there can be multiple valid explanations. Currently, there is no reliable evidence that such explanations actually make the model more interpretable, but there is some evidence of the contrary (Chandrasekaran et al, 2018 ).…”

Section: Shortcomings Of Vandl Researchmentioning

confidence: 99%

See 1 more Smart Citation

Challenges and Prospects in Vision and Language Research

Kafle

Shrestha

Kanan

2019

Front. Artif. Intell.

View full text Add to dashboard Cite

Language grounded image understanding tasks have often been proposed as a method for evaluating progress in artificial intelligence. Ideally, these tasks should test a plethora of capabilities that integrate computer vision, reasoning, and natural language understanding. However, rather than behaving as visual Turing tests, recent studies have demonstrated stateof-the-art systems are achieving good performance through flaws in datasets and evaluation procedures. We review the current state of affairs and outline a path forward.

show abstract

Section: Shortcomings Of Vandl Researchmentioning

confidence: 99%

Section: Shortcomings Of Vandl Researchmentioning

confidence: 99%

Challenges and Prospects in Vision and Language Research

Kafle

Shrestha

Kanan

2019

Front. Artif. Intell.

View full text Add to dashboard Cite

show abstract

“…Usefulness of Explanations Finally, other work studies how useful interpretations are for humans. and Lai and Tan (2019) show that text interpretations can provide benefits to humans, while Chandrasekaran et al (2018) shows explanations for visual QA models provided limited benefit. We present a method that enables adversaries to manipulate interpretations, which can have dire consequences for real-world users (Lakkaraju and Bastani, 2020).…”

Section: Natural Failures Of Interpretation Methodsmentioning

confidence: 99%

Gradient-based Analysis of NLP Models is Manipulable

Wang¹,

Tuyls²,

Wallace³

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Gradient-based analysis methods, such as saliency map visualizations and adversarial input perturbations, have found widespread use in interpreting neural NLP models due to their simplicity, flexibility, and most importantly, their faithfulness. In this paper, however, we demonstrate that the gradients of a model are easily manipulable, and thus bring into question the reliability of gradient-based analyses.In particular, we merge the layers of a target model with a FACADE model that overwhelms the gradients without affecting the predictions. This FACADE model can be trained to have gradients that are misleading and irrelevant to the task, such as focusing only on the stop words in the input. On a variety of NLP tasks (text classification, NLI, and QA), we show that our method can manipulate numerous gradient-based analysis techniques: saliency maps, input reduction, and adversarial perturbations all identify unimportant or targeted tokens as being highly important. The code and a tutorial of this paper is available at http://ucinlp.github.io/facade.

show abstract

“…c. ESIM+ELMo [10]: ESIM is another high performing model for sentence-pair classification tasks, particularly when used with ELMo embeddings [57]. 9 We follow the standard train, val and test splits. VQA Baselines Additionally we compare our approach to models developed on the VQA dataset [5].…”

Section: Baselinesmentioning

confidence: 99%

From Recognition to Cognition: Visual Commonsense Reasoning

Zellers

Bisk

Farhadi

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

653

576

View full text Add to dashboard Cite

Why is [person4 ] pointing at [person1 ]? a) He is telling [person3 ] that [person1 ] ordered the pancakes. b) He just told a joke. c) He is feeling accusatory towards [person1 ]. d) He is giving [person1 ] directions. a) [person1 ] has the pancakes in front of him. b) [person4 ] is taking everyone's order and asked for clarification. c) [person3 ] is looking at the pancakes and both she and [person2 ] are smiling slightly. d) [person3 ] is delivering food to the table, and she might not know whose order is whose. I c h o s e a ) b e c a u s e … a) She is playing guitar for money. b) [person2 ] is a professional musician in an orchestra. c) [person2 ] and [person1 ]are both holding instruments, and were probably busking for that money. d) [person1 ] is putting money in [person2 ]'s tip jar, while she plays music. How did [person2 ] get the money that's in front of her? a) [person2 ] is selling things on the street. b) [person2 ] earned this money playing music. c) She may work jobs for the mafia. d) She won money playing poker. I c h o s e b ) b e c a u s e … Why is [person11] wearing sunglasses inside? What will [person6] do after unpacking the groceries? What are [person1] and [person2] doing? What is [person3] thinking while [person5] shakes his hand? What is [person1]'s relation to [person4]? Where is [person1] now? What would happen if [person3] fell asleep? Hypothetical 5% Scene 5% Role 7% Mental 8% Temporal 13% Activity 24% Explanation 38%

show abstract

Do explanations make VQA models more predictable to a human?

Cited by 49 publications

References 22 publications

Challenges and Prospects in Vision and Language Research

Challenges and Prospects in Vision and Language Research

Gradient-based Analysis of NLP Models is Manipulable

From Recognition to Cognition: Visual Commonsense Reasoning

Contact Info

Product

Resources

About