2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.00676
|View full text |Cite
|
Sign up to set email alerts
|

Adversarial Inference for Multi-Sentence Video Description

Abstract: While significant progress has been made in the image captioning task, video description is still in its infancy due to the complex nature of video data. Generating multisentence descriptions for long videos is even more challenging. Among the main issues are the fluency and coherence of the generated descriptions, and their relevance to the video. Recently, reinforcement and adversarial learning based methods have been explored to improve the image captioning models; however, both types of methods suffer from… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
83
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 89 publications
(83 citation statements)
references
References 91 publications
0
83
0
Order By: Relevance
“…If they are used in DL network directly, the training time will become extremely long, and much computational resources will be occupied due to the large number of layers. So, commonly, researchers apply the pre-trained ResNet on ImageNet dataset to extract visual features from images, and 3D ResNext on Kinetics dataset to extract spatio-temporal features from videos [11]. Then these features are fed to DL network as part of inputs.…”
Section: B Framework Architecture and Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…If they are used in DL network directly, the training time will become extremely long, and much computational resources will be occupied due to the large number of layers. So, commonly, researchers apply the pre-trained ResNet on ImageNet dataset to extract visual features from images, and 3D ResNext on Kinetics dataset to extract spatio-temporal features from videos [11]. Then these features are fed to DL network as part of inputs.…”
Section: B Framework Architecture and Methodsmentioning
confidence: 99%
“…It can be used to replace the 2D LSTM network. Many CV pieces of research have shown that if these techniques can be jointly applied to make full use of the visual data, better results can be obtained [9], [11]. So, a single proper CV technique or an adequate combination of several CV techniques are required to deal with a specific problem in wireless systems.…”
Section: B the Selection Of CV Techniquesmentioning
confidence: 99%
“…The prevailing video captioning techniques often incorporate the encoder-decoder pipeline inspired by the first successful sequence-to-sequence model S2VT [30]. Benefitting from the rapid development of deep learning, video captioning models have achieved remarkable advances using attention mechanism [27,39,45], memory networks [3,14,21,31], reinforcement learning [13,20,33] and generative adversarial networks [19,42]. Although these encoderCdecoder-based methods have reached impressive performance on automatic metrics, they often neglect how well the generated caption words (e.g., objects) are grounded in the video, making models less explainable and trustworthy.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, video captioning [10], the task of automatically generating a sequence of natural-language words to describe a video, has drawn increasing attention [18,19,33,51]. However, these models are known to have poor grounding performance, which leads to objects hallucination [23].…”
Section: Introductionmentioning
confidence: 99%
“…It can replace the 2D LSTM network. Much CV research has shown that if these techniques are jointly applied to make full use of the visual data, better results can be obtained [9], [11].…”
Section: B Selecting CV Techniquesmentioning
confidence: 99%