2017
DOI: 10.1109/tpami.2016.2587640
|View full text |Cite
|
Sign up to set email alerts
|

Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge

Abstract: Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments o… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

11
583
0
3

Year Published

2017
2017
2023
2023

Publication Types

Select...
8
1
1

Relationship

0
10

Authors

Journals

citations
Cited by 799 publications
(597 citation statements)
references
References 36 publications
11
583
0
3
Order By: Relevance
“…The cascaded CNN-RNN frameworks are often intended for different tasks, rather than image classification. For example, [8,45,52] employed CNN-RNN to address the image captioning task, and [50] utilized CNN-RNN to rank the tag list based on the visual importance.…”
Section: Usage Of Cnn-rnn Frameworkmentioning
confidence: 99%
“…The cascaded CNN-RNN frameworks are often intended for different tasks, rather than image classification. For example, [8,45,52] employed CNN-RNN to address the image captioning task, and [50] utilized CNN-RNN to rank the tag list based on the visual importance.…”
Section: Usage Of Cnn-rnn Frameworkmentioning
confidence: 99%
“…The labeling process in the era of big data is fundamental in machine learning-but it is also unimaginably expensive, labor-intensive and time-consuming. To date, only a few datasets with perfect annotations exist that are well managed by science and technological enterprises or research organizations: these include MSCOCO by Microsoft [100], Flicker by Yahoo, ImageNet by Stanford [101], and MNIST by Yann LeCun's research team [102]. As a technique for reducing the cost of labeling, crowdsourcing utilizes spare online resources and accelerates labeling engineering [103,104].…”
Section: Noisy Label Learningmentioning
confidence: 99%
“…Vinyals et al [20] use a simpler approach, motivated by recent advances in statistical machine translation [21]. They generate image captions by transforming an image to a compact representation (a fixed embedding) via deep CNNs (convolutional neural networks) and then using an RNN (recurrent neural network), conditioned on the image and all previously predicted words, to produce sentences.…”
Section: Related Workmentioning
confidence: 99%