Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.98
|View full text |Cite
|
Sign up to set email alerts
|

Semi-Supervised Learning for Video Captioning

Abstract: Deep neural networks have made great success on video captioning in supervised learning setting. However, annotating videos with descriptions is very expensive and time-consuming. If the video captioning algorithm can benefit from a large number of unlabeled videos, the cost of annotation can be reduced. In the proposed study, we make the first attempt to train the video captioning model on labeled data and unlabeled data jointly, in a semisupervised learning manner. For labeled data, we train them with the tr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(1 citation statement)
references
References 35 publications
(41 reference statements)
0
1
0
Order By: Relevance
“…However, these models ignore the saliency information present in a video since the visual regions are often ambiguous and over-sampled, which makes the captioning task more challenging. Even some works (Lin, Gan, and Wang 2020;Liu et al 2019) have focused on the ROI and visual grounding by decomposing expressions into logic structures. Yet, these methods are not able to preserve the semantic structure of salient regions to generate human-like descriptions.…”
Section: Introductionmentioning
confidence: 99%
“…However, these models ignore the saliency information present in a video since the visual regions are often ambiguous and over-sampled, which makes the captioning task more challenging. Even some works (Lin, Gan, and Wang 2020;Liu et al 2019) have focused on the ROI and visual grounding by decomposing expressions into logic structures. Yet, these methods are not able to preserve the semantic structure of salient regions to generate human-like descriptions.…”
Section: Introductionmentioning
confidence: 99%