Top-Down Visual Saliency Guided by Captions

Ramanishka, Vasili; Das, Abir; Zhang, Jianming; Saenko, Kate

doi:10.1109/cvpr.2017.334

Cited by 143 publications

(112 citation statements)

References 23 publications

(39 reference statements)

Supporting

Mentioning

111

Contrasting

Order By: Relevance

“…We use the same video captioning model, without the average pooling layer, trained on Flickr30kEntities for image captioning. The models have comparable ME-TEOR scores to the Caption-Guided Saliency work of [13], to which we compare our results: 26.5 (vs. 25.9) for video captioning and 18.0 (vs. 18.3) for image captioning.…”

Section: Experiments: Caption Groundingmentioning

confidence: 82%

Excitation Backprop for RNNs

Bargal

Zunino²,

Kim

et al. 2018

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Self Cite

View full text Add to dashboard Cite

Figure 1: Our proposed framework spatiotemporally highlights/grounds the evidence that an RNN model used in producing a class label or caption for a given input video. In this example, by using our proposed back-propagation method, the evidence for the activity class CliffDiving is highlighted in a video that contains CliffDiving and HorseRiding. Our model employs a single backward pass to produce saliency maps that highlight the evidence that a given RNN used in generating its outputs. AbstractDeep models are state-of-the-art for many vision tasks including video action recognition and video captioning. Models are trained to caption or classify activity in videos, but little is known about the evidence used to make such decisions. Grounding decisions made by deep networks has been studied in spatial visual content, giving more insight into model predictions for images. However, such studies are relatively lacking for models of spatiotemporal visual content -videos. In this work, we devise a formulation that simultaneously grounds evidence in space and time, in a single pass, using top-down saliency. We visualize the spatiotemporal cues that contribute to a deep model's classification/captioning output using the model's internal representation. Based on these spatiotemporal cues, we are able to localize segments within a video that correspond with a specific action, or phrase from a caption, without explicitly optimizing/training for these tasks.

show abstract

Section: Experiments: Caption Groundingmentioning

confidence: 82%

Excitation Backprop for RNNs

Bargal

Zunino²,

Kim

et al. 2018

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

Self Cite

View full text Add to dashboard Cite

show abstract

“…We regularize weights of the mappings with l 2 regularization with reg value = 0.0005. For VGG, we take outputs from {conv4 1, conv4 3, conv5 1, conv5 3} and map to semantic feature maps with dimension 18×18×1024, and for PNAS-Net we take outputs from {Cell 5, Cell 7, Cell 9, Cell 11} pointing game accuracy attention correctness [41] Ours Ours [41] Ours Ours Class…”

Section: Methodsmentioning

confidence: 99%

“…We apply ReLU to the attention map to zero-out dissimilar wordvisual region pairs, and simply avoid applying softmax on any dimension of the heatmap tensor. Note that this choice is very different in spirit from the commonly used approach of applying softmax to attention maps [50,49,8,34,17,51,41]. Indeed for irrelevant image-sentence pairs, the attention maps would be almost all zeros while the softmax process would always force attention to be a distribution over the image/words summing to 1.…”

Section: Multi-level Multimodal Attention Mechanismmentioning

confidence: 99%

Multi-Level Multimodal Common Semantic Space for Image-Phrase Grounding

Akbari

Karaman

Bhargava

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

We address the problem of phrase grounding by learning a multi-level common semantic space shared by the textual and visual modalities. We exploit multiple levels of feature maps of a Deep Convolutional Neural Network, as well as contextualized word and sentence embeddings extracted from a character-based language model. Following dedicated non-linear mappings for visual features at each level, word, and sentence embeddings, we obtain multiple instantiations of our common semantic space in which comparisons between any target text and the visual content is performed with cosine similarity. We guide the model by a multi-level multimodal attention mechanism which outputs attended visual features at each level. The best level is chosen to be compared with text content for maximizing the pertinence scores of image-sentence pairs of the ground truth. Experiments conducted on three publicly available datasets show significant performance gains (20%-60% relative) over the state-of-the-art in phrase localization and set a new performance record on those datasets. We provide a detailed ablation study to show the contribution of each element of our approach and release our code on GitHub 1 .

show abstract

“…Captioning has also proven to improve performance on image-based multimodal retrieval tasks (Rohrbach et al 2016). Moreover, it is observed (Ramanishka et al 2017) that captioning models can implicitly learn features and attention mechanisms to associate spatiotemporal regions to words in the captions. As for implementation, the paired sentence-clip annotation format in the text-to-clip task allows us to easily add captioning capabilities to our LSTM model.…”

Section: Multi-task Lossmentioning

confidence: 99%

Multilevel Language and Vision Integration for Text-to-Clip Retrieval

Plummer

et al. 2019

AAAI

Self Cite

301

228

View full text Add to dashboard Cite

We address the problem of text-based activity retrieval in video. Given a sentence describing an activity, our task is to retrieve matching clips from an untrimmed video. To capture the inherent structures present in both text and video, we introduce a multilevel model that integrates vision and language features earlier and more tightly than prior work. First, we inject text features early on when generating clip proposals, to help eliminate unlikely clips and thus speed up processing and boost performance. Second, to learn a fine-grained similarity metric for retrieval, we use visual features to modulate the processing of query sentences at the word level in a recurrent neural network. A multi-task loss is also employed by adding query re-generation as an auxiliary task. Our approach significantly outperforms prior work on two challenging benchmarks: Charades-STA and ActivityNet Captions.

show abstract

Top-Down Visual Saliency Guided by Captions

Cited by 143 publications

References 23 publications

Excitation Backprop for RNNs

Excitation Backprop for RNNs

Multi-Level Multimodal Common Semantic Space for Image-Phrase Grounding

Multilevel Language and Vision Integration for Text-to-Clip Retrieval

Contact Info

Product

Resources

About