Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Venugopalan, Subhashini; Xu, Huijuan; Donahue, Jeff; Rohrbach, Marcus; Mooney, Raymond J.; Saenko, Kate

doi:10.3115/v1/n15-1173

Cited by 729 publications

(528 citation statements)

References 39 publications

Supporting

Mentioning

523

Contrasting

Unclassified

Order By: Relevance

“…Previous language and vision studies focused on the development of multimodal word and sentence representations (Bruni et al, 2012;Socher et al, 2013;Silberer and Lapata, 2014;Gong et al, 2014;Lazaridou et al, 2015), as well as methods for describing images and videos in natural language (Farhadi et al, 2010;Kulkarni et al, 2011;Mitchell et al, 2012;Socher et al, 2014;Thomason et al, 2014;Karpathy and Fei-Fei, 2014;Siddharth et al, 2014;Venugopalan et al, 2015;Vinyals et al, 2015). While these studies handle important challenges in multimodal processing of language and vision, they do not provide explicit modeling of linguistic ambiguities.…”

Section: Related Workmentioning

confidence: 99%

Do You See What I Mean? Visual Resolution of Linguistic Ambiguities

Berzak¹,

Barbu²,

Harari³

et al. 2015

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Understanding language goes hand in hand with the ability to integrate complex contextual information obtained via perception. In this work, we present a novel task for grounded language understanding: disambiguating a sentence given a visual scene which depicts one of the possible interpretations of that sentence. To this end, we introduce a new multimodal corpus containing ambiguous sentences, representing a wide range of syntactic, semantic and discourse ambiguities, coupled with videos that visualize the different interpretations for each sentence. We address this task by extending a vision model which determines if a sentence is depicted by a video. We demonstrate how such a model can be adjusted to recognize different interpretations of the same underlying sentence, allowing to disambiguate sentences in a unified fashion across the different ambiguity types.

show abstract

Section: Related Workmentioning

confidence: 99%

Do You See What I Mean? Visual Resolution of Linguistic Ambiguities

Berzak¹,

Barbu²,

Harari³

et al. 2015

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…The degree of automation is significantly limited. While there is active research [36,37,38] on generating natural language descriptions of videos using object detection and text mining, etc, the research on generating natural language descriptions by understanding object motions has halted for more than ten years because of difficulties in the automatic generation of natural language descriptions.…”

Section: Introductionmentioning

confidence: 99%

Dual Sticky Hierarchical Dirichlet Process Hidden Markov Model and Its Application to Natural Language Description of Motions

Tian

Kang

et al. 2018

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Abstract:In this paper, a new nonparametric Bayesian model called the dual sticky hierarchical Dirichlet process hidden Markov model (HDP-HMM) is proposed for mining activities from a collection of time series data such as trajectories. All the time series data are clustered. Each cluster of time series data, corresponding to a motion pattern, is modeled by an HMM. Our model postulates a set of HMMs that share a common set of states (topics in an analogy with topic models for document processing), but have unique transition distributions. The number of HMMs and the number of topics are both automatically determined. The sticky prior avoids redundant states and makes our HDP-HMM more effective to model multimodal observations. For the application to motion trajectory modeling, topics correspond to motion activities. The learnt topics are clustered into atomic activities which are assigned predicates. We propose a Bayesian inference method to decompose a given trajectory into a sequence of atomic activities. The sources and sinks in the scene are learnt by clustering endpoints (origins and destinations) of trajectories. The semantic motion regions are learnt using the points in trajectories. On combining the learnt sources and sinks, semantic motion regions, and the learnt sequence of atomic activities, the action represented by the trajectory can be described in natural language in as automatic a way as possible. The effectiveness of our dual sticky HDP-HMM is validated on several trajectory datasets. The effectiveness of the natural language descriptions for motions is demonstrated on the vehicle trajectories extracted from a traffic scene.

show abstract

“…in video clips, then generates sentences with a fixed template. Some methods use a two steps way [11][12][13], the first step is to generate a fixed length vector representation of frames by extracting features from different CNN, the second step is to decode the vector into a sequence of words as the description of the video clip.…”

Section: Introductionmentioning

confidence: 99%

Multi Semantic Feature Fusion Framework for Video Segmentation and Description

Liang¹,

Zhu²

2016

Proceedings of the 2016 International Conference on Mechatronics Engineering and Information Technology

View full text Add to dashboard Cite

Abstract. It is a difficult task to make machine understanding video and describe it in natural language. In the reality, videos are much longer than these video clips in research experiments, each video contains multi parts of semantic. It is a challenge work to describe a long video, it requires to control the granularity of the video's semantics, exclude redundancy information and give complete description. This task is very important for video understanding and video retrieving. In the paper, we proposed a framework to solve these problems. The framework consists of two stage: video segmentation and video description, the two stage can divide into five steps, firstly extracts features of video sequence with pre-trained deep learning models, secondly fuse different features of a same frame into a feature vector with a weight vector, thirdly generates a histogram of similarity (HOS) of adjacent frames' feature vectors in sequence, fourthly uses a threshold t to divide the video into short fragments of different semantic, finally uses LSTM networks which take frame sequences' features of each fragment as input and output natural language description for each fragment. Our research handles the 'in-the-wild' long videos, it can enhance the comprehensibility of long video, it is meaningful in the task of understanding and describing video.

show abstract

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Cited by 729 publications

References 39 publications

Do You See What I Mean? Visual Resolution of Linguistic Ambiguities

Do You See What I Mean? Visual Resolution of Linguistic Ambiguities

Dual Sticky Hierarchical Dirichlet Process Hidden Markov Model and Its Application to Natural Language Description of Motions

Multi Semantic Feature Fusion Framework for Video Segmentation and Description

Contact Info

Product

Resources

About