Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua 2015
DOI: 10.3115/v1/n15-1173
|View full text |Cite
|
Sign up to set email alerts
|

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Abstract: Solving the visual symbol grounding problem has long been a goal of artificial intelligence. The field appears to be advancing closer to this goal with recent breakthroughs in deep learning for natural language grounding in static images. In this paper, we propose to translate videos directly to sentences using a unified deep neural network with both convolutional and recurrent structure. Described video datasets are scarce, and most existing methods have been applied to toy domains with a small vocabulary of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
523
0
2

Year Published

2015
2015
2023
2023

Publication Types

Select...
6
2
2

Relationship

1
9

Authors

Journals

citations
Cited by 729 publications
(528 citation statements)
references
References 39 publications
3
523
0
2
Order By: Relevance
“…Previous language and vision studies focused on the development of multimodal word and sentence representations (Bruni et al, 2012;Socher et al, 2013;Silberer and Lapata, 2014;Gong et al, 2014;Lazaridou et al, 2015), as well as methods for describing images and videos in natural language (Farhadi et al, 2010;Kulkarni et al, 2011;Mitchell et al, 2012;Socher et al, 2014;Thomason et al, 2014;Karpathy and Fei-Fei, 2014;Siddharth et al, 2014;Venugopalan et al, 2015;Vinyals et al, 2015). While these studies handle important challenges in multimodal processing of language and vision, they do not provide explicit modeling of linguistic ambiguities.…”
Section: Related Workmentioning
confidence: 99%
“…Previous language and vision studies focused on the development of multimodal word and sentence representations (Bruni et al, 2012;Socher et al, 2013;Silberer and Lapata, 2014;Gong et al, 2014;Lazaridou et al, 2015), as well as methods for describing images and videos in natural language (Farhadi et al, 2010;Kulkarni et al, 2011;Mitchell et al, 2012;Socher et al, 2014;Thomason et al, 2014;Karpathy and Fei-Fei, 2014;Siddharth et al, 2014;Venugopalan et al, 2015;Vinyals et al, 2015). While these studies handle important challenges in multimodal processing of language and vision, they do not provide explicit modeling of linguistic ambiguities.…”
Section: Related Workmentioning
confidence: 99%
“…The degree of automation is significantly limited. While there is active research [36,37,38] on generating natural language descriptions of videos using object detection and text mining, etc, the research on generating natural language descriptions by understanding object motions has halted for more than ten years because of difficulties in the automatic generation of natural language descriptions.…”
Section: Introductionmentioning
confidence: 99%
“…in video clips, then generates sentences with a fixed template. Some methods use a two steps way [11][12][13], the first step is to generate a fixed length vector representation of frames by extracting features from different CNN, the second step is to decode the vector into a sequence of words as the description of the video clip.…”
Section: Introductionmentioning
confidence: 99%