2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018
DOI: 10.1109/cvpr.2018.00911
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Dense Video Captioning with Masked Transformer

Abstract: Dense video captioning aims to generate text descriptions for all events in an untrimmed video. This involves both detecting and describing events. Therefore, all previous methods on dense video captioning tackle this problem by building two models, i.e. an event proposal and a captioning model, for these two sub-problems. The models are either trained separately or in alternation. This prevents direct influence of the language description to the event proposal, which is important for generating accurate descr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
459
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 507 publications
(462 citation statements)
references
References 58 publications
3
459
0
Order By: Relevance
“…The lack of sound makes it significantly more difficult, if not impossible in many cases, to describe the rich flow of the story and constituent events. Armed with this intuition, we focus on dense event captioning [22,43,49] (a.k.a. densecaptioning of events in videos [20]) and endow our models with ability to utilize rich auditory signals for both event lo-calization and captioning.…”
Section: One Friend Falls Back Into the Chair In Amazement After The mentioning
confidence: 99%
See 1 more Smart Citation
“…The lack of sound makes it significantly more difficult, if not impossible in many cases, to describe the rich flow of the story and constituent events. Armed with this intuition, we focus on dense event captioning [22,43,49] (a.k.a. densecaptioning of events in videos [20]) and endow our models with ability to utilize rich auditory signals for both event lo-calization and captioning.…”
Section: One Friend Falls Back Into the Chair In Amazement After The mentioning
confidence: 99%
“…Wang et al [43] present a bidirectional proposal method that effectively exploits both past and future contexts to make proposal predictions. In [49], a differentiable masking scheme is used to ensure the consistency between proposal and captioning modules. Li et al [22] propose a descriptiveness regression component to unify the event localization and sentence generation.…”
Section: Related Workmentioning
confidence: 99%
“…1 Most related to the present work are several dense captioning approaches that have been applied to instructional videos (Zhou et al, 2018b,c). Zhou et al (2018c) achieve stateof-the-art performance on the dataset we consider; their model is video-only, and combines a region proposal network (Ren et al, 2015) and a Transformer (Vaswani et al, 2017) decoder. Multimodal Video Captioning.…”
Section: Related Workmentioning
confidence: 99%
“…As a result, all our experiments operate on a subset of the YouCook2 data. While this makes direct comparison with previous and future work more difficult, our performance metrics can be viewed as lower bounds, as they are trained on less data compared to, e.g., (Zhou et al, 2018c). Unless noted otherwise, our analyses are conducted over 1.4K videos and the 10.6K annotated segments contained therein.…”
Section: Datasetmentioning
confidence: 99%
“…The recent success of deep neural networks has enabled end-to-end training in various video understanding tasks such as action recognition [3,5,7,32], and video captioning [28,29,34,39], and many of the models reach signif-*: Both authors contributed equally to this research. icant performances.…”
Section: Introductionmentioning
confidence: 99%