2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
DOI: 10.1109/cvpr.2017.334
|View full text |Cite
|
Sign up to set email alerts
|

Top-Down Visual Saliency Guided by Captions

Abstract: Neural image/video captioning models can generate accurate descriptions, but their internal process of mapping regions to words is a black box and therefore difficult to explain. Top-down neural saliency methods can find important regions given a high-level semantic task such as object classification, but cannot use a natural language sentence as the top-down input for the task. In this paper, we propose Caption-Guided Visual Saliency to expose the regionto-word mapping in modern encoder-decoder networks and d… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
111
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
3
3
2

Relationship

4
4

Authors

Journals

citations
Cited by 143 publications
(112 citation statements)
references
References 23 publications
(39 reference statements)
1
111
0
Order By: Relevance
“…We use the same video captioning model, without the average pooling layer, trained on Flickr30kEntities for image captioning. The models have comparable ME-TEOR scores to the Caption-Guided Saliency work of [13], to which we compare our results: 26.5 (vs. 25.9) for video captioning and 18.0 (vs. 18.3) for image captioning.…”
Section: Experiments: Caption Groundingmentioning
confidence: 82%
“…We use the same video captioning model, without the average pooling layer, trained on Flickr30kEntities for image captioning. The models have comparable ME-TEOR scores to the Caption-Guided Saliency work of [13], to which we compare our results: 26.5 (vs. 25.9) for video captioning and 18.0 (vs. 18.3) for image captioning.…”
Section: Experiments: Caption Groundingmentioning
confidence: 82%
“…We regularize weights of the mappings with l 2 regularization with reg value = 0.0005. For VGG, we take outputs from {conv4 1, conv4 3, conv5 1, conv5 3} and map to semantic feature maps with dimension 18×18×1024, and for PNAS-Net we take outputs from {Cell 5, Cell 7, Cell 9, Cell 11} pointing game accuracy attention correctness [41] Ours Ours [41] Ours Ours Class…”
Section: Methodsmentioning
confidence: 99%
“…We apply ReLU to the attention map to zero-out dissimilar wordvisual region pairs, and simply avoid applying softmax on any dimension of the heatmap tensor. Note that this choice is very different in spirit from the commonly used approach of applying softmax to attention maps [50,49,8,34,17,51,41]. Indeed for irrelevant image-sentence pairs, the attention maps would be almost all zeros while the softmax process would always force attention to be a distribution over the image/words summing to 1.…”
Section: Multi-level Multimodal Attention Mechanismmentioning
confidence: 99%
“…Captioning has also proven to improve performance on image-based multimodal retrieval tasks (Rohrbach et al 2016). Moreover, it is observed (Ramanishka et al 2017) that captioning models can implicitly learn features and attention mechanisms to associate spatiotemporal regions to words in the captions. As for implementation, the paired sentence-clip annotation format in the text-to-clip task allows us to easily add captioning capabilities to our LSTM model.…”
Section: Multi-task Lossmentioning
confidence: 99%