2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.00030
|View full text |Cite
|
Sign up to set email alerts
|

Sketch, Ground, and Refine: Top-Down Dense Video Captioning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
34
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 57 publications
(34 citation statements)
references
References 27 publications
0
34
0
Order By: Relevance
“…Recent work (Zhou et al, 2018a;Li et al, 2018;Zhou et al, 2018c;Mun et al, 2019;Iashin and Rahtu, 2020) follows the twostage "detect-then-describe" framework, in which the event proposal module first predicts a set of event segments, then the captioning module constructs captions for each candidate event segment. Another line of work (Deng et al, 2021;Wang et al, 2021) removes the explicit event proposing process. Deng et al (2021) tackles the DVC task from a top-down perspective, in which they first generate a video-level story, then ground each sentence in the story into a video segment.…”
Section: Multimodal Transformermentioning
confidence: 99%
See 1 more Smart Citation
“…Recent work (Zhou et al, 2018a;Li et al, 2018;Zhou et al, 2018c;Mun et al, 2019;Iashin and Rahtu, 2020) follows the twostage "detect-then-describe" framework, in which the event proposal module first predicts a set of event segments, then the captioning module constructs captions for each candidate event segment. Another line of work (Deng et al, 2021;Wang et al, 2021) removes the explicit event proposing process. Deng et al (2021) tackles the DVC task from a top-down perspective, in which they first generate a video-level story, then ground each sentence in the story into a video segment.…”
Section: Multimodal Transformermentioning
confidence: 99%
“…Another line of work (Deng et al, 2021;Wang et al, 2021) removes the explicit event proposing process. Deng et al (2021) tackles the DVC task from a top-down perspective, in which they first generate a video-level story, then ground each sentence in the story into a video segment. Wang et al (2021) considers the DVC task as a set prediction problem, and applies two parallel prediction heads for event localization and captioning.…”
Section: Multimodal Transformermentioning
confidence: 99%
“…Typical DVC models [30,[43][44][45]60,74,81,82] follow a two-stage, bottomup paradigm: first parse a video into several temporal events and then decode a description from each detected event. As the problem of event detection is ill-defined [10], some alterative solutions either adopt a single-stage strategy to simultaneously predict events and descriptions [35,71], or turn to a top-down regime: first generate paragraphs, and then ground each description to a video segment [10,37]. A few other methods [23,32,50] focus purely on generating better paragraph captions from a provided list of events.…”
Section: Related Workmentioning
confidence: 99%
“…Some tasks use joint image and language inputs, such as captioning or VQA [57,7,8,1,18,20], while some use image and language learning to joint space which enables cross-modal retrieval [25,51,60]. Additional modalities are also often incorporated in such models, for example, video, audio, or additional text such as transcriptions or video captions [70,55,12,22,35].…”
Section: Related Workmentioning
confidence: 99%