Human Focused Video Description

Khan, Muhammad Usman Ghani; Zhang, Lei; Gotoh, Yoshihiko

doi:10.1109/iccvw.2011.6130425

Cited by 34 publications

(25 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Using a set of templates they generate text for their SR. Similarly, [11] uses templates to describe videos on the TREC Video summarization task. [3] follows a different route and uses a topic model to jointly model textual and visual words and a tripartite graph based on object/concept detectors.…”

Section: Related Workmentioning

confidence: 99%

“…While for descriptions of images, recent approaches have proposed to statistically model the conversion from images to text [5,15,16,18], most approaches for video description use rules and templates to generated video descriptions [14,9,2,10,11,26,3,8]. Although these works have started exploring the domain of describing visual content, important research questions remain: (1) How to best approach the conversion from visual information to linguistic expressions?…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Translating Video Content to Natural Language Descriptions

Rohrbach

Qiu

Titov

et al. 2013

2013 IEEE International Conference on Computer Vision

356

267

View full text Add to dashboard Cite

Humans use rich natural language to describe and communicate visual perceptions. In order to provide natural language descriptions for visual content, this paper combines two important ingredients. First, we generate a rich semantic representation of the visual content including e.g. object and activity labels. To predict the semantic representation we learn a CRF to model the relationships between different components of the visual input. And second, we propose to formulate the generation of natural language as a machine translation problem using the semantic representation as source language and the generated sentences as target language. For this we exploit the power of a parallel corpus of videos and textual descriptions and adapt statistical machine translation to translate between our two languages. We evaluate our video descriptions on the TACoS dataset [23], which contains video snippets aligned with sentence descriptions. Using automatic evaluation and human judgments we show significant improvements over several baseline approaches, motivated by prior work. Our translation approach also shows improvements over related work on an image description task.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Translating Video Content to Natural Language Descriptions

Rohrbach

Qiu

Titov

et al. 2013

2013 IEEE International Conference on Computer Vision

356

267

View full text Add to dashboard Cite

show abstract

“…Recent approaches to describing video with natural language have made use of templates, retrieval, or language models [11], [59], [60], [60], [61], [62], [63], [64]. To our knowledge, we present the first application of deep models to the video description task.…”

Section: Prior Workmentioning

confidence: 99%

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

Donahue¹,

Hendricks²,

Rohrbach³

et al. 2014

2,471

2,851

View full text Add to dashboard Cite

Abstract-Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent are effective for tasks involving sequences, visual and otherwise. We describe a class of recurrent convolutional architectures which is end-to-end trainable and suitable for large-scale visual understanding tasks, and demonstrate the value of these models for activity recognition, image captioning, and video description. In contrast to previous models which assume a fixed visual representation or perform simple temporal averaging for sequential processing, recurrent convolutional models are "doubly deep" in that they learn compositional representations in space and time. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Differentiable recurrent models are appealing in that they can directly map variable-length inputs (e.g., videos) to variable-length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent sequence models are directly connected to modern visual convolutional network models and can be jointly trained to learn temporal dynamics and convolutional perceptual representations. Our results show that such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined or optimized.

show abstract

“…Finally, the method applies syntactic rules to generate a whole sentence. Following this work, a series of studies are conducted to utilize such technique to enhance different multimedia applications [2], [3], [4]. And there are some works tackle the problem with probabilistic graphical model.…”

Section: B Visual Captioningmentioning

confidence: 99%

From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning

Song

Guo

Gao

et al. 2019

IEEE Trans. Neural Netw. Learning Syst.

200

View full text Add to dashboard Cite

Video captioning, in essential, is a complex natural process, which is affected by various uncertainties stemming from video content, subjective judgment, and so on. In this paper, we build on the recent progress in using encoder-decoder framework for video captioning and address what we find to be a critical deficiency of the existing methods that most of the decoders propagate deterministic hidden states. Such complex uncertainty cannot be modeled efficiently by the deterministic models. In this paper, we propose a generative approach, referred to as multimodal stochastic recurrent neural networks (MS-RNNs), which models the uncertainty observed in the data using latent stochastic variables. Therefore, MS-RNN can improve the performance of video captioning and generate multiple sentences to describe a video considering different random factors. Specifically, a multimodal long short-term memory (LSTM) is first proposed to interact with both visual and textual features to capture a high-level representation. Then, a backward stochastic LSTM is proposed to support uncertainty propagation by introducing latent variables. Experimental results on the challenging data sets, microsoft video description and microsoft research video-to-text, show that our proposed MS-RNN approach outperforms the state-of-the-art video captioning benchmarks.

show abstract

Human Focused Video Description

Cited by 34 publications

References 18 publications

Translating Video Content to Natural Language Descriptions

Translating Video Content to Natural Language Descriptions

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning

Contact Info

Product

Resources

About