2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018
DOI: 10.1109/cvpr.2018.00801
|View full text |Cite
|
Sign up to set email alerts
|

Differential Attention for Visual Question Answering

Abstract: In this paper we aim to answer questions based on images when provided with a dataset of question-answer pairs for a number of images during training. A number of methods have focused on solving this problem by using image based attention. This is done by focusing on a specific part of the image while answering the question. Humans also do so when solving this problem. However, the regions that the previous systems focus on are not correlated with the regions that humans focus on. The accuracy is limited due t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
53
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 70 publications
(53 citation statements)
references
References 21 publications
0
53
0
Order By: Relevance
“…The current dominant framework for VQA systems consists of an image encoder, a question encoder, multimodal fusion, and an answer predictor. In lieu of directly using visual features from CNN-based feature extractors, [56,11,41,33,49,38,63,36] explored various image attention mechanisms to locate regions that are relevant to the question. To learn a better representation of the question, [33,38,11] proposed to perform question-guided image attention and image-guided question attention collaboratively, to merge knowledge from both visual and textual modalities in the encoding stage.…”
Section: Visual Question Answeringmentioning
confidence: 99%
“…The current dominant framework for VQA systems consists of an image encoder, a question encoder, multimodal fusion, and an answer predictor. In lieu of directly using visual features from CNN-based feature extractors, [56,11,41,33,49,38,63,36] explored various image attention mechanisms to locate regions that are relevant to the question. To learn a better representation of the question, [33,38,11] proposed to perform question-guided image attention and image-guided question attention collaboratively, to merge knowledge from both visual and textual modalities in the encoding stage.…”
Section: Visual Question Answeringmentioning
confidence: 99%
“…The visual attention mechanism has been widely applied to form joint representations of input questions and images, which are subsequently handled by a classifier to produce answer predictions. Recent years have seen significant improvement in terms of performance, by either enhancing the visual attention module (Xu and Saenko, 2016 ; Yang et al, 2016 ; Kazemi and Elqursh, 2017 ; Anderson et al, 2018 ; Patro and Namboodiri, 2018 ), or improving quality of the joint embedding (Fukui et al, 2016 ; Lu et al, 2016 ; Noh et al, 2016 ; Ben-Younes et al, 2017 ; Yu et al, 2017 ). With model ensemble, the current state-of-the-art model has achieved over 72% accuracy (Jiang et al, 2018 ) on the VQA v2.0 test set.…”
Section: Related Workmentioning
confidence: 99%
“…There is little relationship modeling between the question modality and image modality, so that it looks more like a black box without interpretable process. Some recent works (Fukui et al, 2016 ; Lu et al, 2016 ; Noh et al, 2016 ; Xu and Saenko, 2016 ; Ben-Younes et al, 2017 ; Kazemi and Elqursh, 2017 ; Yu et al, 2017 ; Anderson et al, 2018 ; Kim et al, 2018 ; Patro and Namboodiri, 2018 ) introduce the attention mechanism into VQA models to attend questions to salient regions of input images, so that the joint embedding of attended regions and questions carries more accurate information for question answering. With model ensemble, attention based VQA models can achieve over 72% prediction accuracy (Jiang et al, 2018 ) on the test set of the VQA v2.0 dataset (Goyal et al, 2017 ).…”
Section: Introductionmentioning
confidence: 99%
“…Image QA (Antol et al 2015;Gao et al 2015;Ren, Kiros, and Zemel 2015b;Lu et al 2018b;Nam, Ha, and Kim 2017;Yang et al 2016;Xu and Saenko 2016;Lu et al 2016;Patro and Namboodiri 2018;Teney et al 2018), the task that infers answers to questions on a given image, has achieved much progress recently. Based on the framework of image captioning, most early works adopt typical CNN-RNN models, which use Convolutional Neural Networks (CNN) to extract image features and use Recurrent Neural Networks (RNN) to represent question information.…”
Section: Image Question Answeringmentioning
confidence: 99%