2017 IEEE International Conference on Computer Vision (ICCV) 2017
DOI: 10.1109/iccv.2017.83
|View full text |Cite
|
Sign up to set email alerts
|

Dense-Captioning Events in Videos

Abstract: Most natural videos contain numerous events. For example, in a video of a "man playing a piano", the video might also contain "another man dancing" or "a crowd clapping". We introduce the task of dense-captioning events, which involves both detecting and describing events in a video. We propose a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language. Our model introduces a variant of an existing proposal module that… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
1,009
0
4

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 1,052 publications
(1,016 citation statements)
references
References 66 publications
3
1,009
0
4
Order By: Relevance
“…The objective is to extract a set of informative frames in order to briefly summarize the video content. (3) The video caption datasets are annotated with descried sentences or phrases, which can be based on either a trimmed video [75], [77] or different segments of a long video [35]. Our COIN is relevant to the above mentioned datasets, as it requires to localize the temporal boundaries of important steps corresponding to a task.…”
Section: Datasets Related To Instructional Video Analysismentioning
confidence: 99%
See 1 more Smart Citation
“…The objective is to extract a set of informative frames in order to briefly summarize the video content. (3) The video caption datasets are annotated with descried sentences or phrases, which can be based on either a trimmed video [75], [77] or different segments of a long video [35]. Our COIN is relevant to the above mentioned datasets, as it requires to localize the temporal boundaries of important steps corresponding to a task.…”
Section: Datasets Related To Instructional Video Analysismentioning
confidence: 99%
“…The toolbox has two modes: frame mode and video mode. The frame mode is new developed for efficient annotation, while the video mode is frequently used in previous works [35]. We have evaluated the annotation time on a small set of COIN, which contains 25 videos of 7 tasks.…”
Section: Appendix B Annotation Time Cost Analysismentioning
confidence: 99%
“…video retrieval by language, a large number of datasets have been proposed [36,1,26,19,31,30,29,33]. ActivityNet Captions [19] is a dataset with dense captions describing videos from ActivityNet [3], which can facilitate tasks such as video retrieval and temporal localization with language queries. Large Scale Movie Description Challenge (LSMDC) [26] consists of short clips from movies described by natural language.…”
Section: Related Workmentioning
confidence: 99%
“…(2) The descriptions are rich with over 100 words per paragraph. Figure 2 compares ActivityNet Caption [19] with the MSA dataset with examples. We can see that the descriptions in MSA are generally much richer and at a higher level, e.g.…”
Section: Msa Datasetmentioning
confidence: 99%
“…However, we are proposing a textual form of multimodal representation. There are two closely related tasks in the domain of machine learningvisual question-answering (VQA) [5] and dense video captioning [6] tasks. In these tasks, the model is trained to take in the video input and output a sequence of text that describes the video or answer questions.…”
Section: Textual Multimodal Representationmentioning
confidence: 99%