Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2018
DOI: 10.18653/v1/p18-1238
|View full text |Cite
|
Sign up to set email alerts
|

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Abstract: We present a new dataset of image caption annotations, Conceptual Captions, which contains an order of magnitude more images than the MS-COCO dataset (Lin et al., 2014) and represents a wider variety of both images and image caption styles. We achieve this by extracting and filtering image caption annotations from billions of webpages. We also present quantitative evaluations of a number of image captioning models and show that a model architecture based on Inception-ResNet-v2 (Szegedy et al., 2016) for image-… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

3
1,139
0
1

Year Published

2018
2018
2023
2023

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 1,529 publications
(1,143 citation statements)
references
References 21 publications
3
1,139
0
1
Order By: Relevance
“…Second, we report the results in the unsupervised setting with independent image and language sources. We experiment with Flickr30k Images [49] paired with COCO captions and COCO images paired with Google's Conceptual Captions dataset (GCC) [52]. Finally, we show qualitative results for image descriptions with varying text sources.…”
Section: Experiments and Resultsmentioning
confidence: 99%
“…Second, we report the results in the unsupervised setting with independent image and language sources. We experiment with Flickr30k Images [49] paired with COCO captions and COCO images paired with Google's Conceptual Captions dataset (GCC) [52]. Finally, we show qualitative results for image descriptions with varying text sources.…”
Section: Experiments and Resultsmentioning
confidence: 99%
“…Image captioning datasets have ignited a great deal of research at the intersection of the computer vision and natural language processing communities (Lin et al, 2014;Vinyals et al, 2015;Bernardi et al, 2016;Anderson et al, 2018). Getting annotators to provide captions works well with crowd computing, but Sharma et al (2018) exploit incidental supervision for this task to obtain greater scale with their Conceptual Captions dataset. It contains 3.3 million pairs of image and textual captions, where pairs are extracted from HTML web pages using the alt-text field of images as a starting point for their descriptions.…”
Section: Conceptual Captionsmentioning
confidence: 99%
“…There are many sequential filtering steps for improving the quality of the captions-see Sharma et al (2018) for a thorough description. As quality control, a random sample of 4K conceptual captions were rated by human annotators, and 90.3% were judged "good" by at least 2 out of 3 raters.…”
Section: Conceptual Captionsmentioning
confidence: 99%
“…Our Objective The main goal is to compare two approaches in using bottom-up signals: 1) FRCNN: use the default visual features from the Faster R-CNN detector; 2) Ultra: use bounding boxes from the Faster R-CNN detector, then fea-4 Image Captioning Dataset We use the Conceptual Captions (CC) dataset (Sharma et al, 2018), consisting of 3.3 million training and 15,000 validation images/caption pairs. Another 12,000 image/caption pairs comprise the hidden test set.…”
Section: Features and Experimental Setupmentioning
confidence: 99%
“…Model We adopt the encoder-decoder model from (Sharma et al, 2018), whose basic building block is a Transformer Network (Vaswani et al, 2017). To convert multi-modal inputs to a sequence of encoder feature vectors, we use up to three types of image features: L : Label embeddings, obtained by embedding predicted object semantic labels from Google Cloud Vision APIs 3 into a 512D feature vector.…”
Section: Features and Experimental Setupmentioning
confidence: 99%