A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching

Das, Pradipto; Xu, Chenliang; Doell, Richard F.; Corso, Jason J.

doi:10.1109/cvpr.2013.340

Cited by 262 publications

(186 citation statements)

References 23 publications

Supporting

Mentioning

186

Contrasting

Order By: Relevance

“…We call this process sentence directed video object codiscovery. It can be viewed as the inverse of video captioning/description (Barbu et al 2012;Das et al 2013;Guadarrama et al 2013;Rohrbach et al 2014;Venugopalan et al 2015;Yu et al 2015Yu et al , 2016 where object evidence (in the form of detections or other visual features) is first produced by pretrained detectors and then sentences are generated given the object appearance and movement.…”

Section: Figmentioning

confidence: 99%

Sentence Directed Video Object Codiscovery

Yu¹,

Siskind

2017

Int J Comput Vis

View full text Add to dashboard Cite

Video object codiscovery can leverage the weak semantic constraint implied by sentences that describe the video content. Our codiscovery method, like other object codetection techniques, does not employ any pretrained object models or detectors. Unlike most prior work that focuses on codetecting large objects which are usually salient both in size and appearance, our method can discover small or medium sized objects as well as ones that may be occluded for part of the video. More importantly, our method can codiscover multiple object instances of different classes within a single video clip. Although the semantic information employed is usually simple and weak, it can greatly boost performance by constraining the hypothesized object locations. Experiments show promising results on three datasets: an average IoU score of 0.423 on a new dataset with 15 object

show abstract

Section: Figmentioning

confidence: 99%

Sentence Directed Video Object Codiscovery

Yu¹,

Siskind

2017

Int J Comput Vis

View full text Add to dashboard Cite

show abstract

“…Although details differ, [13][14][15][16] operate within the same paradigm of content selection and surface realisation. Compared to image description, typically video description systems additionally employ spatiotemporal methods for action recognition.…”

Section: Related Workmentioning

confidence: 99%

“…In parallel to image captioning, automatic video description is also receiving increasing attention [13][14][15][16]. Although details differ, [13][14][15][16] operate within the same paradigm of content selection and surface realisation.…”

Section: Related Workmentioning

confidence: 99%

“…One of the main issues with the work in [4][5][6][7][8][9][10][11][12][13][14][15][16], however, is the lack of automatic and objective evaluation metric. In [17] the problem of generating natural language description for a given image is relaxed to one of ranking a set of humanwritten captions, by assuming the set contains the original (human-written) caption of the image.…”

Section: Introductionmentioning

confidence: 99%

“…face recognition from caption-based supervision [1], text-to-image coreference [2], and zero-shot visual learning using purely textural description [3]. In particular, generating natural language description for image and video has attracted much interest in both CV and NLP communities [4][5][6][7][8][9][10][11][12][13][14][15][16].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Leveraging High Level Visual Information for Matching Images and Captions

Yan

Mikolajczyk

2015

Computer Vision – ACCV 2014

View full text Add to dashboard Cite

Abstract. In this paper we investigate the problem of matching images and captions. We exploit the kernel canonical correlation analysis (KCCA) to learn a similarity between images and texts. We then propose methods to build improved visual and text kernels. The visual kernels are based on visual classifiers that use responses of a deep convolutional neural network as features, and the text kernel improves the Bag-of-Words (BoW) representation by learning a vision based lexical similarity between words. We consider two application scenarios, one where only an external image set weakly related to the evaluation dataset is available for training the visual classifiers, and one where visual data closely related to the evaluation set can be used. We evaluate our visual and text kernels on a large and publicly available benchmark, where we show that our proposed methods substantially improve upon the state-of-the-art.

show abstract

Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods

Wajid,

Terashima‐Marin,

Najafirad

et al. 2023

Engineering Reports

View full text Add to dashboard Cite

Generating an image/video caption has always been a fundamental problem of Artificial Intelligence, which is usually performed using the potential of Deep Learning Methods, Computer Vision, Knowledge Graphs, and Natural Language Processing (NLP). The significant task of image/video captioning is to describe visual content in terms of natural language. Due to a semantic gap, this presents a massive problem in understanding and explaining images or videos syntactically and semantically. The current systems need somewhere to fill the gap between low‐level and high‐level features while mapping. Therefore, to tackle this problem, there is a need to describe the latest research and methods to overcome difficulties and to propose effective solutions. This work thoroughly analyses and investigates the most related methods (deep learning and knowledge graph‐based approaches), benchmark datasets, and evaluation metrics with their benefits and limitations. Here we have also reviewed the state‐of‐the‐art methods related to image/video captioning and their applications in the current scenario. Finally, we provide thorough information on existing research with comparisons of results on benchmark datasets. We have also mentioned the existing challenges and future direction of research.

show abstract

A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching

Cited by 262 publications

References 23 publications

Sentence Directed Video Object Codiscovery

Sentence Directed Video Object Codiscovery

Leveraging High Level Visual Information for Matching Images and Captions

Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods

Contact Info

Product

Resources

About