2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018
DOI: 10.1109/cvpr.2018.00636
|View full text |Cite
|
Sign up to set email alerts
|

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Abstract: Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and topdown attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Fa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

12
4,566
2
9

Year Published

2018
2018
2021
2021

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 4,379 publications
(4,589 citation statements)
references
References 52 publications
12
4,566
2
9
Order By: Relevance
“…In this work, we propose a novel task of weaklysupervised relation prediction, with the objective of detecting relations between entities in an image purely from captions and object-level bounding box annotations without class information. Our proposed method builds upon top-down attention (Anderson et al, 2018), which generates captions and grounds word in these captions to entities in images.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…In this work, we propose a novel task of weaklysupervised relation prediction, with the objective of detecting relations between entities in an image purely from captions and object-level bounding box annotations without class information. Our proposed method builds upon top-down attention (Anderson et al, 2018), which generates captions and grounds word in these captions to entities in images.…”
Section: Resultsmentioning
confidence: 99%
“…Captioning using visual attention has proven to be very successful in aligning the words in a caption to their corresponding visual features, such as in Anderson et al (2018). As shown in Figure 1, we adopt the two-layer LSTM architecture in Anderson et al (2018); our end goal, however, is to associate each word with the closest object feature rather than producing a caption. The lower Attention LSTM cell takes in the words and the global image context vector (f , the mean of all features F), and its hidden state h a t acts as a query vector.…”
Section: Grounding Caption Words To Object Featuresmentioning
confidence: 99%
“…For each image I, we extract 100 region proposals and associated region features. However, different from bottom-up & top-down attention [25], We select the R ∈ R u×2×2×2048 image region feature as input. We map the dynamically changing question vector to the scaling factor and bias term of the channel feature through the fully connected layer fc and hc .…”
Section: Visual and Language Feature Preprocessmentioning
confidence: 99%
“…The encoder-decoder model first extracts high-level visual features from a CNN trained on the image classification task, and then feeds the visual features into an RNN model to predict subsequent words of a caption for a given image. In recent years, a variety of successive models [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][18][19][20] have achieved promising results. Semantic concept analysis, or attribute prediction [17,21], is a task closely related to image captioning, because attributes can be interpreted as a basis for descriptions.…”
Section: Deep Image Captioningmentioning
confidence: 99%