2020
DOI: 10.1609/aaai.v34i07.6858
|View full text |Cite
|
Sign up to set email alerts
|

Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA

Abstract: In this paper, we aim to obtain improved attention for a visual question answering (VQA) task. It is challenging to provide supervision for attention. An observation we make is that visual explanations as obtained through class activation mappings (specifically Grad-CAM) that are meant to explain the performance of various networks could form a means of supervision. However, as the distributions of attention maps and that of Grad-CAMs differ, it would not be suitable to directly use these as a form of supervis… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
24
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 31 publications
(24 citation statements)
references
References 25 publications
0
24
0
Order By: Relevance
“…Patro et al utilize adversarial training of the attention regions as a two-player game between attention and explanation. 16 We adopted a VQA model similar to what was proposed by Alipour et al 4 where the attention is derived from a transformer model. 17 Counterfactuals: Counterfactual examples have also been used to explain image classifiers.…”
Section: Related Workmentioning
confidence: 99%
“…Patro et al utilize adversarial training of the attention regions as a two-player game between attention and explanation. 16 We adopted a VQA model similar to what was proposed by Alipour et al 4 where the attention is derived from a transformer model. 17 Counterfactuals: Counterfactual examples have also been used to explain image classifiers.…”
Section: Related Workmentioning
confidence: 99%
“…Some other works (such as DCN [39], BAN [40], and MCAN [41]) investigate "dense" co-attention that use bidirectional attention between images and questions. More recent works try to capture a more complex visual-textual information [42]- [45]. Our work instead tries to keep our approach as simple as possible by using three independently trained models to obtain the entropy.…”
Section: Related Workmentioning
confidence: 99%
“…Li et al [26] has proposed an advanced method to generate sequence of conversion about an image. Patro et al [39] have proposed an adversarial method to improve explanation and attention using surrogate supervision method. In this work, we propose a collaborative correlated module to generate both answer and textual explanation of that answer which will be tightly correlated with each other.…”
Section: Related Workmentioning
confidence: 99%