2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.00675
|View full text |Cite
|
Sign up to set email alerts
|

Streamlined Dense Video Captioning

Abstract: Dense video captioning is an extremely challenging task since accurate and coherent description of events in a video requires holistic understanding of video contents as well as contextual reasoning of individual events. Most existing approaches handle this problem by first detecting event proposals from a video and then captioning on a subset of the proposals. As a result, the generated sentences are prone to be redundant or inconsistent since they fail to consider temporal dependency between events. To tackl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
97
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 131 publications
(97 citation statements)
references
References 32 publications
0
97
0
Order By: Relevance
“…However, the language model still has a large performance gap from humans in cases such as small object recognition or object recognition at lower resolutions. Similar to Krishna et al [27] work, Mun et al [64] proposed a dense video captioning framework, that models temporal dependency across events in a video explicitly and leverages visual and linguistic context from prior events for coherent storytelling. They have used Single-Stream Temporal Action model to get some proposals at a single scan, then by implying PtrNet, the highly correlated events that makeup an episode fed into a sequentional captioning network to produce a caption by RNN systems.…”
Section: B Video Captioning Methodologiesmentioning
confidence: 99%
“…However, the language model still has a large performance gap from humans in cases such as small object recognition or object recognition at lower resolutions. Similar to Krishna et al [27] work, Mun et al [64] proposed a dense video captioning framework, that models temporal dependency across events in a video explicitly and leverages visual and linguistic context from prior events for coherent storytelling. They have used Single-Stream Temporal Action model to get some proposals at a single scan, then by implying PtrNet, the highly correlated events that makeup an episode fed into a sequentional captioning network to produce a caption by RNN systems.…”
Section: B Video Captioning Methodologiesmentioning
confidence: 99%
“…ü Ð þAE ½ü . 8 [18, 24, 25, 58, 194, 201, 206, 210, 216, 230∼241] VideoLab [241] TDDF [240] DenseVidCap [238] PickNet [234] HRL [216] RecNet [236] MMM [237] GRU-EVE [18] MA-RNN [230] OA-BTG [239] SibNet [232] Year FGM [194] LSTM-YT [206] TA [58] S2VT [24] h-RNN [231] HRNE [25] GRU-RCN [233] LSTM-TSA [58] SCN-LSTM [235] BAE [201] PickNet [234] RecNet [236] CAM-RNN [210] GRU-EVE [18] SibNet ü ý µ ü Áà [156] , ü ½ü ´¨Þ þÒµ [243] .…”
Section: 4mentioning
confidence: 99%
“…In this study, the video facts are compared with the subtitle sentence by referencing the noun of the sentence in one of the frames of a video. In [22], a new dense framework for video captioning was proposed, which specifically models the temporal dependency of occurrences in the video and utilizes visual and language contexts for coherent storytelling from past events. From the detailed literature survey, it is observed that video storytelling technique-based video content analysis and retrieval is not addressed.…”
Section: Related Workmentioning
confidence: 99%