2013 IEEE International Conference on Computer Vision 2013
DOI: 10.1109/iccv.2013.61
|View full text |Cite
|
Sign up to set email alerts
|

Translating Video Content to Natural Language Descriptions

Abstract: Humans use rich natural language to describe and communicate visual perceptions. In order to provide natural language descriptions for visual content, this paper combines two important ingredients. First, we generate a rich semantic representation of the visual content including e.g. object and activity labels. To predict the semantic representation we learn a CRF to model the relationships between different components of the visual input. And second, we propose to formulate the generation of natural language … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
274
0

Year Published

2015
2015
2022
2022

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 356 publications
(275 citation statements)
references
References 18 publications
1
274
0
Order By: Relevance
“…Fourth, we allow for some creativity in the generation process which produces more humanlike descriptions than a closely related recent approach that derived annotation directly from computer vision inputs (22). Fifth, we presented our work for variety of scene settings instead of single scene setting (9,34). Finally, many recent studies concerned description of a single image frame (36,48).…”
Section: Related Workmentioning
confidence: 99%
“…Fourth, we allow for some creativity in the generation process which produces more humanlike descriptions than a closely related recent approach that derived annotation directly from computer vision inputs (22). Fifth, we presented our work for variety of scene settings instead of single scene setting (9,34). Finally, many recent studies concerned description of a single image frame (36,48).…”
Section: Related Workmentioning
confidence: 99%
“…Although details differ, [13][14][15][16] operate within the same paradigm of content selection and surface realisation. Compared to image description, typically video description systems additionally employ spatiotemporal methods for action recognition.…”
Section: Related Workmentioning
confidence: 99%
“…In parallel to image captioning, automatic video description is also receiving increasing attention [13][14][15][16]. Although details differ, [13][14][15][16] operate within the same paradigm of content selection and surface realisation.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations