2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020
DOI: 10.1109/cvpr42600.2020.01088
|View full text |Cite
|
Sign up to set email alerts
|

Spatio-Temporal Graph for Video Captioning With Knowledge Distillation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
117
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 213 publications
(118 citation statements)
references
References 28 publications
1
117
0
Order By: Relevance
“…Table 1 displays the performance of several models on YouTube2Text. We compare our model with existing methods, including LSTM-E (Pan et al, 2016 ), h-RNN (Yu et al, 2016 ), aLSTMs (Gao et al, 2017 ), SCN (Gan et al, 2017 ), MTVC (Pasunuru and Bansal, 2017a ), ECO (Zolfaghari et al, 2018 ), SibNet (Liu et al, 2018 ), POS (Wang et al, 2019a ), MARN (Pei et al, 2019 ), JSRL-VCT (Hou et al, 2019 ), GRU-EVE (Aafaq et al, 2019 ), STG-KD (Pan et al, 2020 ), SAAT (Zheng et al, 2020 ), and ORG-TRL (Zhang et al, 2020 ). Our method outperforms all the other methods on all the metrics by a large margin.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…Table 1 displays the performance of several models on YouTube2Text. We compare our model with existing methods, including LSTM-E (Pan et al, 2016 ), h-RNN (Yu et al, 2016 ), aLSTMs (Gao et al, 2017 ), SCN (Gan et al, 2017 ), MTVC (Pasunuru and Bansal, 2017a ), ECO (Zolfaghari et al, 2018 ), SibNet (Liu et al, 2018 ), POS (Wang et al, 2019a ), MARN (Pei et al, 2019 ), JSRL-VCT (Hou et al, 2019 ), GRU-EVE (Aafaq et al, 2019 ), STG-KD (Pan et al, 2020 ), SAAT (Zheng et al, 2020 ), and ORG-TRL (Zhang et al, 2020 ). Our method outperforms all the other methods on all the metrics by a large margin.…”
Section: Methodsmentioning
confidence: 99%
“…Table 2 displays the evaluation results of several video captioning models on the MSR-VTT. In this table, we compare our model with existing models, including MTVC (Pasunuru and Bansal, 2017a ), CIDEnt-RL (Pasunuru and Bansal, 2017b ), SibNet (Liu et al, 2018 ), HACA (Wang et al, 2018 ), TAMoE (Wang et al, 2019b ), POS (Wang et al, 2019a ), MARN (Pei et al, 2019 ), JSRL-VCT (Hou et al, 2019 ), GRU-EVE (Aafaq et al, 2019 ), STG-KD (Pan et al, 2020 ), SAAT (Zheng et al, 2020 ), ORG-TRL (Zhang et al, 2020 ). According to the overall score defined in (16), ORG-TRL is the best among existing models.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…The authors of [128] presented an unsupervised image captioning framework based on a new alignment method that allows the simultaneous integration of visual and textual streams through semantic learning of multimodal embeddings of the language and vision domains. Moreover, a multimodal model can also aggregate motion information [174], acoustic information [175], temporal information [176], etc. from successive frames to assign a caption for each one.…”
Section: Image Captioningmentioning
confidence: 99%
“…Chen et al [22,23] processed natural languages in video captioning, focusing on image objects. Most strategies in the last two years, such as dual-stream recurrent neural network, object relational graph (ORG) with teacher-recommended learning (TRL), and spatio-temporal graph with knowledge distillation (STG-KD) [24][25][26], are optimized with features of video images. Few of them take the natural captioning of sentences into full consideration.…”
Section: Literature Reviewmentioning
confidence: 99%