2019
DOI: 10.1145/3295748
|View full text |Cite
|
Sign up to set email alerts
|

A Comprehensive Survey of Deep Learning for Image Captioning

Abstract: Generating a description of an image is called image captioning. Image captioning requires to recognize the important objects, their attributes and their relationships in an image. It also needs to generate syntactically and semantically correct sentences. Deep learning-based techniques are capable of handling the complexities and challenges of image captioning. In this survey paper, we aim to present a comprehensive review of existing deep learning-based image captioning techniques. We discuss the foundation … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
331
0
3

Year Published

2019
2019
2023
2023

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 631 publications
(334 citation statements)
references
References 123 publications
0
331
0
3
Order By: Relevance
“…Understanding image captioning is essential because it is the fundamental building block of any captioning pipeline. We, thus, briefly overview some of the most relevant works and refer the readers to [7] for further reading.…”
Section: Related Work 21 Image Captioningmentioning
confidence: 99%
See 2 more Smart Citations
“…Understanding image captioning is essential because it is the fundamental building block of any captioning pipeline. We, thus, briefly overview some of the most relevant works and refer the readers to [7] for further reading.…”
Section: Related Work 21 Image Captioningmentioning
confidence: 99%
“…α ω i, j , α д i, j α v i, j are defined in Eqs. (11), (7), and (10), respectively. We empirically validate the hypothesis by studying the quantities of the attentions (provided in Figure 5) estimated from different schemes.…”
Section: Geometrymentioning
confidence: 99%
See 1 more Smart Citation
“…A indexação de imagensé importante para a recuperação de imagens baseada em conteúdo (em inglês, Content-Based Image Retrieval (CBIR)) e, portanto, pode ser aplicada a muitasáreas, incluindo biomedicina, comércio, educação, bibliotecas e pesquisa na web. Pode-se citar também o uso da tarefa em plataformas de mídias sociais, com o intuito de inferir, a partir da imagem, onde o usuário está (praia, café etc) [Hossain et al 2019]. Outro exemplo seria produzir explicações sobre o que acontece em um vídeo, quadro a quadro, já que um quadroé uma imagem estática, indicando cada cena, o que poderia ser um grande auxílio para pessoas com deficiência visual.…”
Section: Introductionunclassified
“…I am grateful for all the discussions I had with my fellow graduate students in the Gated Recurrent Units (GRUs [2]) have been successful in many applications involving sequential data. Examples can be found in text classification [3], image and video captioning [4,5], speech recognition [6,7], and action and gesture recognition [8][9][10]. The success of these deep learning models lies in the complex feature representations they learn from the training data and encoding the temporal information.…”
mentioning
confidence: 99%