Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge

Vinyals, Oriol; Toshev, Alexander; Bengio, Samy; Erhan, Dumitru

doi:10.1109/tpami.2016.2587640

Cited by 799 publications

(597 citation statements)

References 36 publications

Supporting

Mentioning

583

Contrasting

Unclassified

Order By: Relevance

“…The cascaded CNN-RNN frameworks are often intended for different tasks, rather than image classification. For example, [8,45,52] employed CNN-RNN to address the image captioning task, and [50] utilized CNN-RNN to rank the tag list based on the visual importance.…”

Section: Usage Of Cnn-rnn Frameworkmentioning

confidence: 99%

CNN-RNN: a large-scale hierarchical image classification framework

Guo¹,

Liu²,

Bakker³

et al. 2017

Multimed Tools Appl

122

View full text Add to dashboard Cite

Objects are often organized in a semantic hierarchy of categories, where finelevel categories are grouped into coarse-level categories according to their semantic relations. While previous works usually only classify objects into the leaf categories, we argue that generating hierarchical labels can actually describe how the leaf categories evolved from higher level coarse-grained categories, thus can provide a better understanding of the objects. In this paper, we propose to utilize the CNN-RNN framework to address the hierarchical image classification task. CNN allows us to obtain discriminative features for the input images, and RNN enables us to jointly optimize the classification of coarse and fine labels. This framework can not only generate hierarchical labels for images, but also improve the traditional leaf-level classification performance due to incorporating the hierarchical information. Moreover, this framework can be built on top of any CNN architecture which is primarily designed for leaf-level classification. Accordingly, we build a high performance network based on the CNN-RNN paradigm which outperforms the original CNN (wider-ResNet) and also the current state-of-the-art. In addition, we investigate how to utilize the CNN-RNN framework to improve the fine category classification when a fraction of the training data is only annotated with coarse labels. Experimental results demonstrate that CNN-RNN can use the coarse-labeled training data to improve the classification of fine categories, and in some cases it even surpasses the performance achieved by fully annotated training data. This reveals that, CNN-RNN can alleviate the challenge of specialized and expensive annotation of fine labels.

show abstract

Section: Usage Of Cnn-rnn Frameworkmentioning

confidence: 99%

CNN-RNN: a large-scale hierarchical image classification framework

Guo¹,

Liu²,

Bakker³

et al. 2017

Multimed Tools Appl

122

View full text Add to dashboard Cite

show abstract

“…The labeling process in the era of big data is fundamental in machine learning-but it is also unimaginably expensive, labor-intensive and time-consuming. To date, only a few datasets with perfect annotations exist that are well managed by science and technological enterprises or research organizations: these include MSCOCO by Microsoft [100], Flicker by Yahoo, ImageNet by Stanford [101], and MNIST by Yann LeCun's research team [102]. As a technique for reducing the cost of labeling, crowdsourcing utilizes spare online resources and accelerates labeling engineering [103,104].…”

Section: Noisy Label Learningmentioning

confidence: 99%

Addressing Complexities of Machine Learning in Big Data: Principles, Trends and Challenges from Systematical Perspectives

Q¹,

Zhao²,

Huang³

et al. 2017

Preprint

View full text Add to dashboard Cite

The concept of 'big data' has been widely discussed, and its value has been illuminated throughout a variety of domains. To quickly mine potential values and alleviate the ever-increasing volume of information, machine learning is playing an increasingly important role and faces more challenges than ever. Because few studies exist regarding how to modify machine learning techniques to accommodate big data environments, we provide a comprehensive overview of the history of the evolution of big data, the foundations of machine learning, and the bottlenecks and trends of machine learning in the big data era. More specifically, based on learning principals, we discuss regularization to enhance generalization. The challenges of quality in big data are reduced to the curse of dimensionality, class imbalances, concept drift and label noise, and the underlying reasons and mainstream methodologies to address these challenges are introduced. Learning model development has been driven by domain specifics, dataset complexities, and the presence or absence of human involvement. In this paper, we propose a robust learning paradigm by aggregating the aforementioned factors. Over the next few decades, we believe that these perspectives will lead to novel ideas and encourage more studies aimed at incorporating knowledge and establishing data-driven learning systems that involve both data quality considerations and human interactions.

show abstract

“…Vinyals et al [20] use a simpler approach, motivated by recent advances in statistical machine translation [21]. They generate image captions by transforming an image to a compact representation (a fixed embedding) via deep CNNs (convolutional neural networks) and then using an RNN (recurrent neural network), conditioned on the image and all previously predicted words, to produce sentences.…”

Section: Related Workmentioning

confidence: 99%

Estimating the information gap between textual and visual representations

Henning

Ewerth

2017

Int J Multimed Info Retr

View full text Add to dashboard Cite

Photos, drawings, figures, etc. supplement textual information in various kinds of media, for example, in web news or scientific publications. In this respect, the intended effect of an image can be quite different, e.g., providing additional information, focusing on certain details of surrounding text, or simply being a general illustration of a topic. As a consequence, the semantic correlation between information of different modalities can vary noticeably, too. Moreover, cross-modal interrelations are often hard to describe in a precise way. The variety of possible interrelations of textual and graphical information and the question, how they can be described and automatically estimated have not been addressed yet by previous work. In this paper, we present several contributions to close this gap. First, we introduce two measures to describe cross-modal interrelations: cross-modal mutual information (CMI) and semantic correlation (SC). Second, a novel approach relying on deep learning is suggested to estimate CMI and SC of textual and visual information. Third, three diverse datasets are leveraged to learn an appropriate deep neural network model for the demanding task. The system has been evaluated on a challenging test set and the experimental results demonstrate the feasibility of the approach. In this respect, the intended effect of an image can be quite different, e.g., providing additional information, focusing on certain details of surrounding text, or simply being a general illustration of a topic. As a consequence, the semantic correlation between information of different modalities can vary noticeably, too. Moreover, cross-modal interrelations are often hard to describe in a precise way. The variety of possible interrelations of textual and graphical information and the question, how they can be described and automatically estimated have not been addressed yet by previous work. In this paper, we present several contributions to close this gap. First, we introduce two measures to describe crossmodal interrelations: cross-modal mutual information (CMI) and semantic correlation (SC). Second, a novel approach relying on deep learning is suggested to estimate CMI and SC of textual and visual information. Third, three diverse datasets are leveraged to learn an appropriate deep neural network model for the demanding task. The system has been evaluated on a challenging test set and the experimental results demonstrate the feasibility of the approach.

show abstract

Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge

Cited by 799 publications

References 36 publications

CNN-RNN: a large-scale hierarchical image classification framework

CNN-RNN: a large-scale hierarchical image classification framework

Addressing Complexities of Machine Learning in Big Data: Principles, Trends and Challenges from Systematical Perspectives

Estimating the information gap between textual and visual representations

Contact Info

Product

Resources

About