2017
DOI: 10.1007/978-3-319-54190-7_7
|View full text |Cite
|
Sign up to set email alerts
|

Spatio-Temporal Attention Models for Grounded Video Captioning

Abstract: Abstract. Automatic video captioning is challenging due to the complex interactions in dynamic real scenes. A comprehensive system would ultimately localize and track the objects, actions and interactions present in a video and generate a description that relies on temporal localization in order to ground the visual concepts. However, most existing automatic video captioning systems map from raw video data to high level textual description, bypassing localization and recognition, thus discarding potentially va… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
22
1

Year Published

2017
2017
2024
2024

Publication Types

Select...
4
4
2

Relationship

0
10

Authors

Journals

citations
Cited by 29 publications
(23 citation statements)
references
References 36 publications
0
22
1
Order By: Relevance
“…They first use offthe-shelf or fine-tuned object detectors to propose object proposals/detections as for the visual recognition heavylifting. Then, in the second stage, they either attend to the object regions dynamically [17,39,1] or classify the regions into labels and fill into pre-defined/generated sentence templates [16,5,13]. However, directly generating proposals from off-the-shelf detectors causes the proposals to bias towards classes in the source dataset (i.e.…”
Section: Related Workmentioning
confidence: 99%
“…They first use offthe-shelf or fine-tuned object detectors to propose object proposals/detections as for the visual recognition heavylifting. Then, in the second stage, they either attend to the object regions dynamically [17,39,1] or classify the regions into labels and fill into pre-defined/generated sentence templates [16,5,13]. However, directly generating proposals from off-the-shelf detectors causes the proposals to bias towards classes in the source dataset (i.e.…”
Section: Related Workmentioning
confidence: 99%
“…Over the past years there has been an increased interest in video description generation, notably with the broader adoption of the deep learning techniques. S2VT [58] was among the first approaches based on LSTMs [19,11]; some of the later ones include [38,49,52,68,72,73]. Most recently, a number of approaches to video description have been proposed, such as replacing LSTM with a Transformer Network [76], introducing a reconstruction objective [59], using bidirectional attention fusion for context modeling [61], and others [7,13,33].…”
Section: Related Workmentioning
confidence: 99%
“…Typically the focus is to ge-nerate a single sentence about a single image [7,20,28,52,55], video [7,13,35,39,48], or most closely to this work, movie clip [36,51]. Several works also produce grounding while generating the description: [55] propose an attention mechanism to ground each word to spatial CNN image features, [57] extend this to bounding boxes, [56] to video frames, and [61] to spatial-temporal proposals. [25] look into evaluating attention correctness for image captioning.…”
Section: Related Workmentioning
confidence: 99%