2021
DOI: 10.48550/arxiv.2107.06912
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

From Show to Tell: A Survey on Deep Learning-based Image Captioning

Abstract: Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introdu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
32
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 20 publications
(32 citation statements)
references
References 157 publications
(284 reference statements)
0
32
0
Order By: Relevance
“…Aerial view of a road in autumn. Many approaches have been proposed for image captioning [4,9,13,19,34,35,42,44,47]. Typically, these works utilize an encoder for visual cues and a textual decoder to produce the final caption.…”
Section: * Equal Contributionmentioning
confidence: 99%
See 1 more Smart Citation
“…Aerial view of a road in autumn. Many approaches have been proposed for image captioning [4,9,13,19,34,35,42,44,47]. Typically, these works utilize an encoder for visual cues and a textual decoder to produce the final caption.…”
Section: * Equal Contributionmentioning
confidence: 99%
“…Note that our method does not employ the CLIP's textual encoder, since there is no input text, and the output text is generated by a language model. Commonly, image captioning [34] models first encode the input pixels as feature vectors, which are then used to produce the final sequence of words. Early works utilize the features extracted from a pre-trained classification network [6,9,13,42], while later works [4,19,47] exploit the more expressive features of an object detection network [31].…”
Section: Related Workmentioning
confidence: 99%
“…However, due to a limited number of problem instances, this formulation of a target task hasn't been considered to date in any of the proposed solutions. While image captioning is a flourishing field with visible progress in recent years [100][101][102], none of the existing methods tackles the problem of describing the abstract concepts present in AVR tasks in natural language. Besides the mentioned set of hand-crafted BPs, there aren't any other benchmarks that could be used for evaluating the quality of image captioning methods in AVR settings.…”
Section: Descriptionmentioning
confidence: 99%
“…On the other hand, recent progress in image captioning [100][101][102] and natural language generation coupled with scene understanding [149][150][151] suggests that current learning systems are, in principle, capable of generating descriptions in natural language to reasoning problems with visual input. This, in turn, suggests that the lack of successful methods for describing answers to AVR tasks in natural language may arise not from the lack of capacity of the proposed models, but rather from the unavailability of appropriate datasets on which such models could be trained.…”
Section: Descriptionmentioning
confidence: 99%
“…Supervised image captioning traditionally relies on paired image-caption data to train a generative model which creates a text description given an input image. In recent years, the research community has significantly raised the level of performance for the image captioning task [35]. Some earlier work such as [18] adopts Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) with global image feature as input, while others such as [3,40] proposed to add attention over the grid of CNN features.…”
Section: Related Workmentioning
confidence: 99%