2013 IEEE Conference on Computer Vision and Pattern Recognition 2013
DOI: 10.1109/cvpr.2013.340
|View full text |Cite
|
Sign up to set email alerts
|

A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching

Abstract: The problem of describing images through natural language has gained importance in the computer vision community. Solutions to image description have either focused on a top-down approach of generating language through combinations of object detections and language models or bottom-up propagation of keyword tags from training images to test images through probabilistic or nearest neighbor techniques. In contrast, describing videos with natural language is a less studied problem. In this paper, we combine ideas… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
186
0

Year Published

2015
2015
2023
2023

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 262 publications
(186 citation statements)
references
References 23 publications
0
186
0
Order By: Relevance
“…We call this process sentence directed video object codiscovery. It can be viewed as the inverse of video captioning/description (Barbu et al 2012;Das et al 2013;Guadarrama et al 2013;Rohrbach et al 2014;Venugopalan et al 2015;Yu et al 2015Yu et al , 2016 where object evidence (in the form of detections or other visual features) is first produced by pretrained detectors and then sentences are generated given the object appearance and movement.…”
Section: Figmentioning
confidence: 99%
“…We call this process sentence directed video object codiscovery. It can be viewed as the inverse of video captioning/description (Barbu et al 2012;Das et al 2013;Guadarrama et al 2013;Rohrbach et al 2014;Venugopalan et al 2015;Yu et al 2015Yu et al , 2016 where object evidence (in the form of detections or other visual features) is first produced by pretrained detectors and then sentences are generated given the object appearance and movement.…”
Section: Figmentioning
confidence: 99%
“…Although details differ, [13][14][15][16] operate within the same paradigm of content selection and surface realisation. Compared to image description, typically video description systems additionally employ spatiotemporal methods for action recognition.…”
Section: Related Workmentioning
confidence: 99%
“…In parallel to image captioning, automatic video description is also receiving increasing attention [13][14][15][16]. Although details differ, [13][14][15][16] operate within the same paradigm of content selection and surface realisation.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations