Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Sharma, Piyush; Ding, Nan; Goodman, Sebastian; Soricut, Radu

doi:10.18653/v1/p18-1238

Cited by 1,529 publications

(1,143 citation statements)

References 21 publications

Supporting

Mentioning

1,139

Contrasting

Unclassified

Order By: Relevance

“…Second, we report the results in the unsupervised setting with independent image and language sources. We experiment with Flickr30k Images [49] paired with COCO captions and COCO images paired with Google's Conceptual Captions dataset (GCC) [52]. Finally, we show qualitative results for image descriptions with varying text sources.…”

Section: Experiments and Resultsmentioning

confidence: 99%

Towards Unsupervised Image Captioning With Shared Multimodal Embeddings

Laina

Rupprecht

Navab

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

Understanding images without explicit supervision has become an important problem in computer vision. In this paper, we address image captioning by generating language descriptions of scenes without learning from annotated pairs of images and their captions. The core component of our approach is a shared latent space that is structured by visual concepts. In this space, the two modalities should be indistinguishable. A language model is first trained to encode sentences into semantically structured embeddings. Image features that are translated into this embedding space can be decoded into descriptions through the same language model, similarly to sentence embeddings. This translation is learned from weakly paired images and text using a loss robust to noisy assignments and a conditional adversarial component. Our approach allows to exploit large text corpora outside the annotated distributions of image/caption data. Our experiments show that the proposed domain alignment learns a semantically meaningful representation which outperforms previous work.

show abstract

Section: Experiments and Resultsmentioning

confidence: 99%

Towards Unsupervised Image Captioning With Shared Multimodal Embeddings

Laina

Rupprecht

Navab

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

show abstract

“…Image captioning datasets have ignited a great deal of research at the intersection of the computer vision and natural language processing communities (Lin et al, 2014;Vinyals et al, 2015;Bernardi et al, 2016;Anderson et al, 2018). Getting annotators to provide captions works well with crowd computing, but Sharma et al (2018) exploit incidental supervision for this task to obtain greater scale with their Conceptual Captions dataset. It contains 3.3 million pairs of image and textual captions, where pairs are extracted from HTML web pages using the alt-text field of images as a starting point for their descriptions.…”

Section: Conceptual Captionsmentioning

confidence: 99%

“…There are many sequential filtering steps for improving the quality of the captions-see Sharma et al (2018) for a thorough description. As quality control, a random sample of 4K conceptual captions were rated by human annotators, and 90.3% were judged "good" by at least 2 out of 3 raters.…”

Section: Conceptual Captionsmentioning

confidence: 99%

Large-Scale Representation Learning from Visually Grounded Untranscribed Speech

Ilharco

Zhang

Baldridge

2019

Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

View full text Add to dashboard Cite

Systems that can associate images with their spoken audio captions are an important step towards visually grounded language learning. We describe a scalable method to automatically generate diverse audio for image captioning datasets. This supports pretraining deep networks for encoding both audio and images, which we do via a dual encoder that learns to align latent representations from both modalities. We show that a masked margin softmax loss for such models is superior to the standard triplet loss. We fine-tune these models on the Flickr8k Audio Captions Corpus and obtain state-of-the-art results-improving recall in the top 10 from 29.6% to 49.5%. We also obtain human ratings on retrieval outputs to better assess the impact of incidentally matching image-caption pairs that were not associated in the data, finding that automatic evaluation substantially underestimates the quality of the retrieved results. * Work done as a member of the Google AI Residency Program.

show abstract

“…Our Objective The main goal is to compare two approaches in using bottom-up signals: 1) FRCNN: use the default visual features from the Faster R-CNN detector; 2) Ultra: use bounding boxes from the Faster R-CNN detector, then fea-4 Image Captioning Dataset We use the Conceptual Captions (CC) dataset (Sharma et al, 2018), consisting of 3.3 million training and 15,000 validation images/caption pairs. Another 12,000 image/caption pairs comprise the hidden test set.…”

Section: Features and Experimental Setupmentioning

confidence: 99%

“…Model We adopt the encoder-decoder model from (Sharma et al, 2018), whose basic building block is a Transformer Network (Vaswani et al, 2017). To convert multi-modal inputs to a sequence of encoder feature vectors, we use up to three types of image features: L : Label embeddings, obtained by embedding predicted object semantic labels from Google Cloud Vision APIs 3 into a 512D feature vector.…”

Section: Features and Experimental Setupmentioning

confidence: 99%

Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic Labels Improve Image Captioning and Visual Question Answering

Changpinyo

Pang

Sharma

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

Self Cite

View full text Add to dashboard Cite

Object detection plays an important role in current solutions to vision and language tasks like image captioning and visual question answering. However, popular models like Faster R-CNN rely on a costly process of annotating ground-truths for both the bounding boxes and their corresponding semantic labels, making it less amenable as a primitive task for transfer learning. In this paper, we examine the effect of decoupling box proposal and featurization for down-stream tasks. The key insight is that this allows us to leverage a large amount of labeled annotations that were previously unavailable for standard object detection benchmarks. Empirically, we demonstrate that this leads to effective transfer learning and improved image captioning and visual question answering models, as measured on publiclyavailable benchmarks.

show abstract

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Cited by 1,529 publications

References 21 publications

Towards Unsupervised Image Captioning With Shared Multimodal Embeddings

Towards Unsupervised Image Captioning With Shared Multimodal Embeddings

Large-Scale Representation Learning from Visually Grounded Untranscribed Speech

Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic Labels Improve Image Captioning and Visual Question Answering

Contact Info

Product

Resources

About