2017
DOI: 10.1007/978-3-319-54407-6_18
|View full text |Cite
|
Sign up to set email alerts
|

Video Captioning via Sentence Augmentation and Spatio-Temporal Attention

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(2 citation statements)
references
References 36 publications
0
2
0
Order By: Relevance
“…In particular, at each word generation step, the decoder takes as input the video features weighted according to their relevance to the next word, based on the previously emitted words [31], [32], [38], [39]. With the same principle, in [40] the attention mechanism is applied to the mean-pooled features from a predefined number of objects tracklets in the video. In [41], the textual information is used to select Regionsof-Interest (ROIs) in the video frames, whose descriptors are combined with those of the global frame in a Dual Memory Recurrent Model.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…In particular, at each word generation step, the decoder takes as input the video features weighted according to their relevance to the next word, based on the previously emitted words [31], [32], [38], [39]. With the same principle, in [40] the attention mechanism is applied to the mean-pooled features from a predefined number of objects tracklets in the video. In [41], the textual information is used to select Regionsof-Interest (ROIs) in the video frames, whose descriptors are combined with those of the global frame in a Dual Memory Recurrent Model.…”
Section: Related Workmentioning
confidence: 99%
“…Few works on NLVD operate at input level. In [40] data augmentation is proposed at the sentence level. The authors proposed to enrich the sentence part of the MSR-VTT with sentences from the MSVD.…”
Section: Related Workmentioning
confidence: 99%