Spatio-Temporal Graph for Video Captioning With Knowledge Distillation

Pan, Boxiao; Cai, Haoye; Huang, De-An; Lee, Kuan‐Hui; Gaidon, Adrien; Adeli, Ehsan; Niebles, Juan Carlos

doi:10.1109/cvpr42600.2020.01088

Cited by 213 publications

(118 citation statements)

References 28 publications

Supporting

Mentioning

117

Contrasting

Order By: Relevance

“…Table 1 displays the performance of several models on YouTube2Text. We compare our model with existing methods, including LSTM-E (Pan et al, 2016 ), h-RNN (Yu et al, 2016 ), aLSTMs (Gao et al, 2017 ), SCN (Gan et al, 2017 ), MTVC (Pasunuru and Bansal, 2017a ), ECO (Zolfaghari et al, 2018 ), SibNet (Liu et al, 2018 ), POS (Wang et al, 2019a ), MARN (Pei et al, 2019 ), JSRL-VCT (Hou et al, 2019 ), GRU-EVE (Aafaq et al, 2019 ), STG-KD (Pan et al, 2020 ), SAAT (Zheng et al, 2020 ), and ORG-TRL (Zhang et al, 2020 ). Our method outperforms all the other methods on all the metrics by a large margin.…”

Section: Methodsmentioning

confidence: 99%

“…Table 2 displays the evaluation results of several video captioning models on the MSR-VTT. In this table, we compare our model with existing models, including MTVC (Pasunuru and Bansal, 2017a ), CIDEnt-RL (Pasunuru and Bansal, 2017b ), SibNet (Liu et al, 2018 ), HACA (Wang et al, 2018 ), TAMoE (Wang et al, 2019b ), POS (Wang et al, 2019a ), MARN (Pei et al, 2019 ), JSRL-VCT (Hou et al, 2019 ), GRU-EVE (Aafaq et al, 2019 ), STG-KD (Pan et al, 2020 ), SAAT (Zheng et al, 2020 ), ORG-TRL (Zhang et al, 2020 ). According to the overall score defined in (16), ORG-TRL is the best among existing models.…”

Section: Methodsmentioning

confidence: 99%

“…By aggregating different experts on different known activities, Wang et al ( 2019b ) take advantage of external textual corpora and transfer knowledge to unseen data for zero-shot video captioning. A spatio-temporal graph model is built to find object interactions and knowledge distillation mechanism is proposed to increase stability of performance (Pan et al, 2020 ).…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

A Semantics-Assisted Video Captioning Model Trained With Scheduled Sampling

et al. 2020

View full text Add to dashboard Cite

Given the features of a video, recurrent neural networks can be used to automatically generate a caption for the video. Existing methods for video captioning have at least three limitations. First, semantic information has been widely applied to boost the performance of video captioning models, but existing networks often fail to provide meaningful semantic features. Second, the Teacher Forcing algorithm is often utilized to optimize video captioning models, but during training and inference, different strategies are applied to guide word generation, leading to poor performance. Third, current video captioning models are prone to generate relatively short captions that express video contents inappropriately. Toward resolving these three problems, we suggest three corresponding improvements. First of all, we propose a metric to compare the quality of semantic features, and utilize appropriate features as input for a semantic detection network (SDN) with adequate complexity in order to generate meaningful semantic features for videos. Then, we apply a scheduled sampling strategy that gradually transfers the training phase from a teacher-guided manner toward a more self-teaching manner. Finally, the ordinary logarithm probability loss function is leveraged by sentence length so that the inclination of generating short sentences is alleviated. Our model achieves better results than previous models on the YouTube2Text dataset and is competitive with the previous best model on the MSR-VTT dataset.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

A Semantics-Assisted Video Captioning Model Trained With Scheduled Sampling

et al. 2020

View full text Add to dashboard Cite

show abstract

“…The authors of [128] presented an unsupervised image captioning framework based on a new alignment method that allows the simultaneous integration of visual and textual streams through semantic learning of multimodal embeddings of the language and vision domains. Moreover, a multimodal model can also aggregate motion information [174], acoustic information [175], temporal information [176], etc. from successive frames to assign a caption for each one.…”

Section: Image Captioningmentioning

confidence: 99%

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

et al. 2021

View full text Add to dashboard Cite

The research progress in multimodal learning has grown rapidly over the last decade in several areas, especially in computer vision. The growing potential of multimodal data streams and deep learning algorithms has contributed to the increasing universality of deep multimodal learning. This involves the development of models capable of processing and analyzing the multimodal information uniformly. Unstructured real-world data can inherently take many forms, also known as modalities, often including visual and textual content. Extracting relevant patterns from this kind of data is still a motivating goal for researchers in deep learning. In this paper, we seek to improve the understanding of key concepts and algorithms of deep multimodal learning for the computer vision community by exploring how to generate deep models that consider the integration and combination of heterogeneous visual cues across sensory modalities. In particular, we summarize six perspectives from the current literature on deep multimodal learning, namely: multimodal data representation, multimodal fusion (i.e., both traditional and deep learning-based schemes), multitask learning, multimodal alignment, multimodal transfer learning, and zero-shot learning. We also survey current multimodal applications and present a collection of benchmark datasets for solving problems in various vision domains. Finally, we highlight the limitations and challenges of deep multimodal learning and provide insights and directions for future research.

show abstract

“…Chen et al [22,23] processed natural languages in video captioning, focusing on image objects. Most strategies in the last two years, such as dual-stream recurrent neural network, object relational graph (ORG) with teacher-recommended learning (TRL), and spatio-temporal graph with knowledge distillation (STG-KD) [24][25][26], are optimized with features of video images. Few of them take the natural captioning of sentences into full consideration.…”

Section: Literature Reviewmentioning

confidence: 99%

Label Importance Ranking with Entropy Variation Complex Networks for Structured Video Captioning

Wei¹,

Hu²

2021

View full text Add to dashboard Cite

Structured video captioning is a fundamental yet challenging task in both computer vision and artificial intelligence (AI). The prevalent approach is to map an input video to a variablelength output sentence with models like recurrent neural network (RNN). This paper presents a new model based on an improved scene-aware bidirectional long short-term memory network (SABi-LSTM), and names the model as label importance ranking with entropy variation complex networks of structured video captions. Structured video captioning is a three-level structured system, including a multi-feature fusion level, an SABi-LSTM level, and a label importance ranking level. The system decomposes structures of multiple levels and dimensions from different perspectives to perform video captioning. This work affirms the theoretical and practical significance of label importance ranking to video caption generation, and regards entropy as a local level metric to quantify label importance. Hence, entropy variation was proposed to define label importance, namely, the variation of the network entropy through label removal. It is assumed that the removal of an important label could cause sustainable variation to the structure. Hence, the authors defined the label importance ranking with entropy variation complex network algorithm to calculate the weight model of label nodes marked by video, and obtain the final caption of the video. Empirical results on Microsoft Video Caption (MSVD) dataset and MSR-Video to Text (MSR-VTT) dataset demonstrate the superiority of our approach for structured video captioning, especially on MSVD dataset.

show abstract

Spatio-Temporal Graph for Video Captioning With Knowledge Distillation

Cited by 213 publications

References 28 publications

A Semantics-Assisted Video Captioning Model Trained With Scheduled Sampling

A Semantics-Assisted Video Captioning Model Trained With Scheduled Sampling

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Label Importance Ranking with Entropy Variation Complex Networks for Structured Video Captioning

Contact Info

Product

Resources

About