2017 IEEE International Conference on Computer Vision (ICCV) 2017
DOI: 10.1109/iccv.2017.145
|View full text |Cite
|
Sign up to set email alerts
|

Structured Attentions for Visual Question Answering

Abstract: Visual attention, which assigns weights to image regions according to their relevance to a question, is considered as an indispensable part by most Visual Question Answering models. Although the questions may involve complex relations among multiple regions, few attention models can effectively encode such cross-region relations. In this paper, we demonstrate the importance of encoding such relations by showing the limited effective receptive field of ResNet on two datasets, and propose to model the visual att… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
54
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
4
4
1
1

Relationship

0
10

Authors

Journals

citations
Cited by 94 publications
(54 citation statements)
references
References 24 publications
0
54
0
Order By: Relevance
“…The current dominant framework for VQA systems consists of an image encoder, a question encoder, multimodal fusion, and an answer predictor. In lieu of directly using visual features from CNN-based feature extractors, [56,11,41,33,49,38,63,36] explored various image attention mechanisms to locate regions that are relevant to the question. To learn a better representation of the question, [33,38,11] proposed to perform question-guided image attention and image-guided question attention collaboratively, to merge knowledge from both visual and textual modalities in the encoding stage.…”
Section: Visual Question Answeringmentioning
confidence: 99%
“…The current dominant framework for VQA systems consists of an image encoder, a question encoder, multimodal fusion, and an answer predictor. In lieu of directly using visual features from CNN-based feature extractors, [56,11,41,33,49,38,63,36] explored various image attention mechanisms to locate regions that are relevant to the question. To learn a better representation of the question, [33,38,11] proposed to perform question-guided image attention and image-guided question attention collaboratively, to merge knowledge from both visual and textual modalities in the encoding stage.…”
Section: Visual Question Answeringmentioning
confidence: 99%
“…By automatically ignoring irrelevant information from the data, neural networks can selectively focus on important features. This approach has achieved great success in Natural Language Processing (NLP) [3], image captioning [40] and VQA [46]. There are many variants of the attention mechanism.…”
Section: Related Workmentioning
confidence: 99%
“…The first work to use graph networks in VQA is [34], which combines dependency parses of questions and scene graph representations of abstract scenes. [45] proposes modeling structured visual attention over a Conditional Random Field on image regions. A recent work, [28], conditions on a question to learn a graph representation of an image, capturing object interactions with the relevant neighbours via spatial graph convolutions.…”
Section: Graph Network and Contextualized Representationsmentioning
confidence: 99%