Translating Video Content to Natural Language Descriptions

Rohrbach, Marcus; Qiu, Wei; Titov, Ivan; Thater, Stefan; Pinkal, Manfred; Schiele, Bernt

doi:10.1109/iccv.2013.61

Cited by 356 publications

(275 citation statements)

References 18 publications

Supporting

Mentioning

274

Contrasting

Order By: Relevance

“…Fourth, we allow for some creativity in the generation process which produces more humanlike descriptions than a closely related recent approach that derived annotation directly from computer vision inputs (22). Fifth, we presented our work for variety of scene settings instead of single scene setting (9,34). Finally, many recent studies concerned description of a single image frame (36,48).…”

Section: Related Workmentioning

confidence: 99%

Generating natural language tags for video information management

Khan

Gotoh

2017

Machine Vision and Applications

View full text Add to dashboard Cite

This exploratory work is concerned with generation of natural language descriptions that can be used for video retrieval applications. It is a step ahead of keyword based tagging as it captures relations between keywords associated with videos. Firstly we prepare hand annotations consisting of descriptions for video segments crafted from a TREC Video dataset. Analysis of this data presents insights into human's interests on video contents. Secondly we develop a framework for creating smooth and coherent description of video streams. It builds on conventional image processing techniques that extract high level features from individual video frames. Natural language description is then produced based on high level features. Although feature extraction processes are erroneous at various levels, we explore approaches to putting them together to produce a coherent, smooth and well phrased description by incorporating spatial and temporal information. Evaluation is made by calculating ROUGE scores between human annotated and machine generated descriptions. Further we introduce a task based evaluation by human subjects which provides qualitative evaluation of generated descriptions.

show abstract

Section: Related Workmentioning

confidence: 99%

Generating natural language tags for video information management

Khan

Gotoh

2017

Machine Vision and Applications

View full text Add to dashboard Cite

show abstract

“…Although details differ, [13][14][15][16] operate within the same paradigm of content selection and surface realisation. Compared to image description, typically video description systems additionally employ spatiotemporal methods for action recognition.…”

Section: Related Workmentioning

confidence: 99%

“…In parallel to image captioning, automatic video description is also receiving increasing attention [13][14][15][16]. Although details differ, [13][14][15][16] operate within the same paradigm of content selection and surface realisation.…”

Section: Related Workmentioning

confidence: 99%

“…One of the main issues with the work in [4][5][6][7][8][9][10][11][12][13][14][15][16], however, is the lack of automatic and objective evaluation metric. In [17] the problem of generating natural language description for a given image is relaxed to one of ranking a set of humanwritten captions, by assuming the set contains the original (human-written) caption of the image.…”

Section: Introductionmentioning

confidence: 99%

“…face recognition from caption-based supervision [1], text-to-image coreference [2], and zero-shot visual learning using purely textural description [3]. In particular, generating natural language description for image and video has attracted much interest in both CV and NLP communities [4][5][6][7][8][9][10][11][12][13][14][15][16].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Leveraging High Level Visual Information for Matching Images and Captions

Yan

Mikolajczyk

2015

Computer Vision – ACCV 2014

View full text Add to dashboard Cite

Abstract. In this paper we investigate the problem of matching images and captions. We exploit the kernel canonical correlation analysis (KCCA) to learn a similarity between images and texts. We then propose methods to build improved visual and text kernels. The visual kernels are based on visual classifiers that use responses of a deep convolutional neural network as features, and the text kernel improves the Bag-of-Words (BoW) representation by learning a vision based lexical similarity between words. We consider two application scenarios, one where only an external image set weakly related to the evaluation dataset is available for training the visual classifiers, and one where visual data closely related to the evaluation set can be used. We evaluate our visual and text kernels on a large and publicly available benchmark, where we show that our proposed methods substantially improve upon the state-of-the-art.

show abstract

Deep learning‐based body part recognition algorithm for three‐dimensional medical images

et al. 2022

View full text Add to dashboard Cite

Background The automatic recognition of human body parts in three‐dimensional medical images is important in many clinical applications. However, methods presented in prior studies have mainly classified each two‐dimensional (2D) slice independently rather than recognizing a batch of consecutive slices as a specific body part. Purpose In this study, we aim to develop a deep learning‐based method designed to automatically divide computed tomography (CT) and magnetic resonance imaging (MRI) scans into five consecutive body parts: head, neck, chest, abdomen, and pelvis. Methods A deep learning framework was developed to recognize body parts in two stages. In the first preclassification stage, a convolutional neural network (CNN) using the GoogLeNet Inception v3 architecture and a long short‐term memory (LSTM) network were combined to classify each 2D slice; the CNN extracted information from a single slice, whereas the LSTM employed rich contextual information among consecutive slices. In the second postprocessing stage, the input scan was further partitioned into consecutive body parts by identifying the optimal boundaries between them based on the slice classification results of the first stage. To evaluate the performance of the proposed method, 662 CT and 1434 MRI scans were used. Results Our method achieved a very good performance in 2D slice classification compared with state‐of‐the‐art methods, with overall classification accuracies of 97.3% and 98.2% for CT and MRI scans, respectively. Moreover, our method further divided whole scans into consecutive body parts with mean boundary errors of 8.9 and 3.5 mm for CT and MRI data, respectively. Conclusions The proposed method significantly improved the slice classification accuracy compared with state‐of‐the‐art methods, and further accurately divided CT and MRI scans into consecutive body parts based on the results of slice classification. The developed method can be employed as an important step in various computer‐aided diagnosis and medical image analysis schemes.

show abstract

Translating Video Content to Natural Language Descriptions

Cited by 356 publications

References 18 publications

Generating natural language tags for video information management

Generating natural language tags for video information management

Leveraging High Level Visual Information for Matching Images and Captions

Deep learning‐based body part recognition algorithm for three‐dimensional medical images

Contact Info

Product

Resources

About