2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00273
|View full text |Cite
|
Sign up to set email alerts
|

Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network

Abstract: In this paper, we propose to guide the video caption generation with Part-of-Speech (POS) information, based on a gated fusion of multiple representations of input videos. We construct a novel gated fusion network, with one particularly designed cross-gating (CG) block, to effectively encode and fuse different types of representations, e.g., the motion and content features of an input video. One POS sequence generator relies on this fused representation to predict the global syntactic structure, which is there… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
100
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 162 publications
(100 citation statements)
references
References 50 publications
0
100
0
Order By: Relevance
“…We propose the DL network framework shown in Fig. 3 composed of ResNet [2], 3D ResNext [8], a feature-fusion module (FFM) [9], and predictive network.…”
Section: B Framework Architecture and Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…We propose the DL network framework shown in Fig. 3 composed of ResNet [2], 3D ResNext [8], a feature-fusion module (FFM) [9], and predictive network.…”
Section: B Framework Architecture and Methodsmentioning
confidence: 99%
“…It can replace the 2D LSTM network. Much CV research has shown that if these techniques are jointly applied to make full use of the visual data, better results can be obtained [9], [11].…”
Section: B Selecting CV Techniquesmentioning
confidence: 99%
“…3. It is composed of ResNet [2], 3D ResNext [8], feature fusion module (FFM) [9] and predictive network which will be elaborated as below.…”
Section: B Framework Architecture and Methodsmentioning
confidence: 99%
“…It can be used to replace the 2D LSTM network. Many CV pieces of research have shown that if these techniques can be jointly applied to make full use of the visual data, better results can be obtained [9], [11]. So, a single proper CV technique or an adequate combination of several CV techniques are required to deal with a specific problem in wireless systems.…”
Section: B the Selection Of CV Techniquesmentioning
confidence: 99%
“…Besides, to make the generated captions more diverse and accurate, Deshpande et al leveraged the quantized Part-of-Speech (POS) tag sequence sampled from a given benchmark to condition word prediction at the decoding recurrent model [6]. Wang et al tried to predict the POS sequence tag by tag from the input video, and then embeded them as a global POS representation to gate the inputs of the sentence decoder for syntax control [32]. With manually altering the predicted POS tag sequence, Wang et al showed that they can obtain captions with different syntaxes.…”
Section: Controllable Captioning With Auxiliary Information Guidancementioning
confidence: 99%