From Show to Tell: A Survey on Deep Learning-based Image Captioning

Stefanini, Matteo; Cornia, Marcella; Baraldi, Lorenzo; Cascianelli, Silvia; Fiameni, Giuseppe; Cucchiara, Rita

doi:10.48550/arxiv.2107.06912

Cited by 20 publications

(32 citation statements)

References 157 publications

(284 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Aerial view of a road in autumn. Many approaches have been proposed for image captioning [4,9,13,19,34,35,42,44,47]. Typically, these works utilize an encoder for visual cues and a textual decoder to produce the final caption.…”

Section: * Equal Contributionmentioning

confidence: 99%

“…Note that our method does not employ the CLIP's textual encoder, since there is no input text, and the output text is generated by a language model. Commonly, image captioning [34] models first encode the input pixels as feature vectors, which are then used to produce the final sequence of words. Early works utilize the features extracted from a pre-trained classification network [6,9,13,42], while later works [4,19,47] exploit the more expressive features of an object detection network [31].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

ClipCap: CLIP Prefix for Image Captioning

Mokady¹,

Hertz²,

Bermano³

2021

Preprint

109

137

View full text Add to dashboard Cite

Image captioning is a fundamental task in visionlanguage understanding, where the model predicts a textual informative caption to a given input image. In this paper, we present a simple approach to address this task. We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions. The recently proposed CLIP model contains rich semantic features which were trained with textual context, making it best for vision-language perception. Our key idea is that together with a pre-trained language model (GPT2), we obtain a wide understanding of both visual and textual data. Hence, our approach only requires rather quick training to produce a competent captioning model. Without additional annotations or pre-training, it efficiently generates meaningful captions for large-scale and diverse datasets. Surprisingly, our method works well even when only the mapping network is trained, while both CLIP and the language model remain frozen, allowing a lighter architecture with less trainable parameters. Through quantitative evaluation, we demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets, while it is simpler, faster, and lighter. Our code is available in https://github. com/rmokady/CLIP_prefix_caption.

show abstract

Section: * Equal Contributionmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

ClipCap: CLIP Prefix for Image Captioning

Mokady¹,

Hertz²,

Bermano³

2021

Preprint

109

137

View full text Add to dashboard Cite

show abstract

“…However, due to a limited number of problem instances, this formulation of a target task hasn't been considered to date in any of the proposed solutions. While image captioning is a flourishing field with visible progress in recent years [100][101][102], none of the existing methods tackles the problem of describing the abstract concepts present in AVR tasks in natural language. Besides the mentioned set of hand-crafted BPs, there aren't any other benchmarks that could be used for evaluating the quality of image captioning methods in AVR settings.…”

Section: Descriptionmentioning

confidence: 99%

“…On the other hand, recent progress in image captioning [100][101][102] and natural language generation coupled with scene understanding [149][150][151] suggests that current learning systems are, in principle, capable of generating descriptions in natural language to reasoning problems with visual input. This, in turn, suggests that the lack of successful methods for describing answers to AVR tasks in natural language may arise not from the lack of capacity of the proposed models, but rather from the unavailability of appropriate datasets on which such models could be trained.…”

Section: Descriptionmentioning

confidence: 99%

A Review of Emerging Research Directions in Abstract Visual Reasoning

Małkiński¹,

Mańdziuk²

2022

Preprint

View full text Add to dashboard Cite

Visual Reasoning (AVR) problems are commonly used to approximate human intelligence. They test the ability of applying previously gained knowledge, experience and skills in a completely new setting, which makes them particularly well-suited for this task. Recently, the AVR problems have become popular as a proxy to study machine intelligence, which has led to emergence of new distinct types of problems and multiple benchmark sets. In this work we review this emerging AVR research and propose a taxonomy to categorise the AVR tasks along 5 dimensions: input shapes, hidden rules, target task, cognitive function, and main challenge. The perspective taken in this survey allows to characterise AVR problems with respect to their shared and distinct properties, provides a unified view on the existing approaches for solving AVR tasks, shows how the AVR problems relate to practical applications, and outlines promising directions for future work. One of them refers to the observation that in the machine learning literature different tasks are considered in isolation, which is in the stark contrast with the way the AVR tasks are used to measure human intelligence, where multiple types of problems are combined within a single IQ test.

show abstract

“…Supervised image captioning traditionally relies on paired image-caption data to train a generative model which creates a text description given an input image. In recent years, the research community has significantly raised the level of performance for the image captioning task [35]. Some earlier work such as [18] adopts Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) with global image feature as input, while others such as [3,40] proposed to add attention over the grid of CNN features.…”

Section: Related Workmentioning

confidence: 99%