2016
DOI: 10.1007/978-3-319-46478-7_28
|View full text |Cite
|
Sign up to set email alerts
|

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

Abstract: Abstract. We address the problem of Visual Question Answering (VQA), which requires joint image and language understanding to answer a question about a given photograph. Recent approaches have applied deep image captioning methods based on convolutional-recurrent networks to this problem, but have failed to model spatial inference. To remedy this, we propose a model we call the Spatial Memory Network and apply it to the VQA task. Memory networks are recurrent neural networks with an explicit attention mechanis… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
478
0

Year Published

2016
2016
2021
2021

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 609 publications
(479 citation statements)
references
References 40 publications
1
478
0
Order By: Relevance
“…Similarly to WSL, the attention-based models [63,29,66,64] select relevant regions to support decisions. However the WSL methods usually include some structure on the selection process while it is implicit in attention-based approaches.…”
Section: Related Workmentioning
confidence: 99%
“…Similarly to WSL, the attention-based models [63,29,66,64] select relevant regions to support decisions. However the WSL methods usually include some structure on the selection process while it is implicit in attention-based approaches.…”
Section: Related Workmentioning
confidence: 99%
“…Noh et al [15] adopted visual attention with joint loss minimization. Xu et al [16] obtained attention map by calculating the semantic similarity between image regions and the question. Ilievski et al [17] used an off-the-shelf object detector to catch the important regions, and then fed the regions into LSTM with global image features.…”
Section: B Attention Mechanisms For Vqamentioning
confidence: 99%
“…Our approach in the combination with the Normalized Correlation Analysis embedding technique improves on the state-of-the-art of the Visual Madlibs task. Text-Embedding Loss: Motivated by the popularity of deep architectures for visual question answering, that combine a global CNN image representation with an LSTM [7] question representation [4,13,17,20,29,30,31], as well as the leading performance of nCCA on the multi-choice Visual Madlibs task [32], we propose a novel extension of the CNN+LSTM architecture that chooses a prompt completion out of four candidates (see Figure 4) by measuring similarities directly in the embedding space. This contrasts with the prior approach of [32] that uses a post-hoc comparison between the discrete output of the CNN+LSTM method and all four candidates.…”
Section: Arxiv:160802717v1 [Cscv] 9 Aug 2016mentioning
confidence: 99%
“…over the image, yield state of the art results [29,30,31]. Another, more focused "hard" attention, has also been studied in the image-to-text retrieval scenario [9] as well as fine-grained categorization [33], person recognition [19] and zero-shot learning [1].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation