2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018
DOI: 10.1109/cvpr.2018.00714
|View full text |Cite
|
Sign up to set email alerts
|

Interpretable Video Captioning via Trajectory Structured Localization

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
30
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 57 publications
(30 citation statements)
references
References 19 publications
0
30
0
Order By: Relevance
“…Recently, state-of-the-art methods based on encoderdecoder framework seek to make a breakthrough either in the encoding phase [6,10,30,34,56] or in the decoding phase [37,51,62]. Take for examples the cases that focus on the encoding phase, VideoLAB [34] proposes to fuse multiple modalities of source information to improve the captioning performance while PickNet [6] aims to pick the informative frames by reinforcement learning.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, state-of-the-art methods based on encoderdecoder framework seek to make a breakthrough either in the encoding phase [6,10,30,34,56] or in the decoding phase [37,51,62]. Take for examples the cases that focus on the encoding phase, VideoLAB [34] proposes to fuse multiple modalities of source information to improve the captioning performance while PickNet [6] aims to pick the informative frames by reinforcement learning.…”
Section: Related Workmentioning
confidence: 99%
“…Similar to the experiments on MSR-VTT dataset, two groups of baselines are compared with our model on MSVD dataset: (1) fundamental methods including MP-LSTM with AlexNet as encoding scheme, S2VT and SA-LSTM that both use Inception-V4 for encoding, GRU-RCN [2] that leverages recurrent convolutional networks to learn video representation, HRNE [30] which proposes a Hierarchical Recurrent Neural Encoder to capture the temporal information of source videos, LSTM-E [32] which seeks to explore the decoding with LSTM and visual-semantic embedding simultaneously, LSTM-LS [26] which aims to model the relationships of different video-sequence pairs, h-RNN [62] that employs a paragraph generator to capture the intersentence dependency by sentence generators, aLSTMs [12] that models both encoder and decoder using LSTM with attention mechanism; (2) three newly published state-of-theart methods, i.e., PickNet, RecNet and TSA-ED [56] which extracts the spatial-temporal representation in the trajectory level by structured attention mechanism. The experimental results presented in Table 5 show that our MARN model performs significantly better than other methods on all metrics except BLEU-4.…”
Section: Comparison On Msvdmentioning
confidence: 99%
“…Obviously, under the predefined template with fixed syntactic structure, those methods are hard to generate flexible language descriptions. Nowadays, benefit from the success of CNN and RNN, the sequence learning methods [42,53,27,28,8,45,49] are widely used to describe video content with flexible syntactic structure. In [43], Venugopalan et al obtained video representation by averaging CNN feature of each frame, which ignored the temporal information.…”
Section: Video Captioningmentioning
confidence: 99%
“…OA-BTG constructs bidirectional temporal graph and performs object-aware feature aggregation to achieve above goals, which helps to generate accurate and fine-grained captions for better performance. TSA-ED [34] also utilizes trajectory information, which introduces a trajectory structured attentional encoderdecoder network which explores the fine-grained motion information. Although it extracts dense point trajectories, it loss the object-aware information.…”
Section: Training Detailsmentioning
confidence: 99%