2013
DOI: 10.1109/tpami.2012.162
|View full text |Cite
|
Sign up to set email alerts
|

BabyTalk: Understanding and Generating Simple Image Descriptions

Abstract: Abstract-We present a system to automatically generate natural language descriptions from images. This system consists of two parts. The first part, content planning, smooths the output of computer vision-based detection and recognition algorithms with statistics mined from large pools of visually descriptive text to determine the best content words to use to describe an image. The second step, surface realization, chooses words to construct natural language sentences based on the predicted content and general… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
345
0
1

Year Published

2014
2014
2021
2021

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 704 publications
(373 citation statements)
references
References 39 publications
0
345
0
1
Order By: Relevance
“…Our work has been inspired by the works building very large-scale image databases [8,38] and the works establishing semantic connections of texts and images [25]. We observe good semantic coherence between labels obtained by hierarchical document topic models [6] and clinician's assessment.…”
Section: Introductionmentioning
confidence: 84%
See 3 more Smart Citations
“…Our work has been inspired by the works building very large-scale image databases [8,38] and the works establishing semantic connections of texts and images [25]. We observe good semantic coherence between labels obtained by hierarchical document topic models [6] and clinician's assessment.…”
Section: Introductionmentioning
confidence: 84%
“…Image-to-language correspondence was learned from ImageNet dataset and reasonably high quality image description datasets (Pascal1K [36], Flickr8K [16], Flickr30K [47]) in [20], where such caption datasets are not available in the medical domain. Graphical models have been employed to predict image attributes ( [27,39]), or to describe images ( [25]) using manually annotated datasets ( [36,26]). Automatic label mining on large, unlabeled datasets is presented in [35,18], however the variety of the label-space is limited (image text annotations).…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Here we take an analagous approach-modifying the image retrieval stage of data-driven pipeline-for the task of image captioning. There has been significant recent interest in generating natural language descriptions of photographs (Kulkarni et al 2013;Farhadi et al 2010b). These techniques are typically quite complex: they recognize various visual concepts such as objects, materials, scene types, and the spatial relationship among these entities, and then generate plausible natural language sentences based on this scene understanding.…”
Section: Scene Attributes As Global Featuresmentioning
confidence: 99%