A Comprehensive Survey of Deep Learning for Image Captioning

Hossain, Md. Zakir; Sohel, Ferdous; Shiratuddin, Mohd Fairuz; Laga, Hamid

doi:10.1145/3295748

Cited by 631 publications

(334 citation statements)

References 123 publications

Supporting

Mentioning

331

Contrasting

Unclassified

Order By: Relevance

“…Understanding image captioning is essential because it is the fundamental building block of any captioning pipeline. We, thus, briefly overview some of the most relevant works and refer the readers to [7] for further reading.…”

Section: Related Work 21 Image Captioningmentioning

confidence: 99%

“…α ω i, j , α д i, j α v i, j are defined in Eqs. (11), (7), and (10), respectively. We empirically validate the hypothesis by studying the quantities of the attentions (provided in Figure 5) estimated from different schemes.…”

Section: Geometrymentioning

confidence: 99%

“…Advancements in computer vision applications, such as object detection and segmentation, have laid a strong foundation of comprehensive context understanding in images. Besides learning on visual domain, tasks such as image captioning (IC) [3,7,24] and visual question answering (VQA) [1] are the iconic examples that connect vision and language modalities to not only provide better visual reasoning, but also enable multimodal context understanding. The IC task is to generate a human understandable sentence from a given image.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Geometry-aware Relational Exemplar Attention for Dense Captioning

Wang

Tavakoli

Sjöberg

et al. 2019

1st International Workshop on Multimodal Understanding and Learning for Embodied Applications

View full text Add to dashboard Cite

Dense captioning (DC), which provides a comprehensive context understanding of images by describing all salient visual groundings in an image, facilitates multimodal understanding and learning. As an extension of image captioning, DC is developed to discover richer sets of visual contents and to generate captions of wider diversity and increased details. The state-of-the-art models of DC consist of three stages: (1) region proposals, (2) region classification, and (3) caption generation for each proposal. They are typically built upon the following ideas: (a) guiding the caption generation with image-level features as the context cues along with regional features and (b) refining locations of region proposals with caption information. In this work, we propose (a) a joint visual-textual criterion exploited by the region classifier that further improves both region detection and caption accuracy, and (b) a Geometryaware Relational Exemplar attention (GREatt) mechanism to relate region proposals. The former helps the model learn a region classifier by effectively exploiting both visual groundings and caption descriptions. Rather than treating each region proposal in isolation, the latter relates regions in complementary relations, i.e. contextually dependent, visually supported and geometry relations, to enrich context information in regional representations. We conduct an extensive set of experiments and demonstrate that our proposed model improves the state-of-the-art by at least +5.3% in terms of the mean average precision on the Visual Genome dataset.

show abstract

Section: Related Work 21 Image Captioningmentioning

confidence: 99%

Section: Geometrymentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Geometry-aware Relational Exemplar Attention for Dense Captioning

Wang

Tavakoli

Sjöberg

et al. 2019

1st International Workshop on Multimodal Understanding and Learning for Embodied Applications

View full text Add to dashboard Cite

show abstract

“…A indexação de imagensé importante para a recuperação de imagens baseada em conteúdo (em inglês, Content-Based Image Retrieval (CBIR)) e, portanto, pode ser aplicada a muitasáreas, incluindo biomedicina, comércio, educação, bibliotecas e pesquisa na web. Pode-se citar também o uso da tarefa em plataformas de mídias sociais, com o intuito de inferir, a partir da imagem, onde o usuário está (praia, café etc) [Hossain et al 2019]. Outro exemplo seria produzir explicações sobre o que acontece em um vídeo, quadro a quadro, já que um quadroé uma imagem estática, indicando cada cena, o que poderia ser um grande auxílio para pessoas com deficiência visual.…”

Section: Introductionunclassified

Deep Learning para Geração Automática de Legenda de Imagem

Scoparo¹,

Serapião²

2019

Anais Do XVI Encontro Nacional De Inteligência Artificial E Computacional (ENIAC 2019)

View full text Add to dashboard Cite

A geração automática de legenda de imagem é uma tarefa que consiste em decifrar uma imagem e descrevê-la em frases em linguagem natural. Combina Processamento de Linguagem Natural e Visão Computacional para gerar legendas. Recentemente, os métodos de Deep Learning estão obtendo resultados muito promissores para o problema de geração de legendas. O presente trabalho propôs, com base no modelo NIC (Neural Image Caption), uma combinação de redes neurais convolucionais sobre imagens e rede neural recorrente sobre frases, alinhando-as a um objetivo estruturado de criar a descrição textual das imagens. Os resultados mostraram que o modelo neural proposto foi capaz de aprender o modelo da linguagem sobre o conteúdo da imagem, produzindo descrições precisas na maioria das imagens.

show abstract

“…I am grateful for all the discussions I had with my fellow graduate students in the Gated Recurrent Units (GRUs [2]) have been successful in many applications involving sequential data. Examples can be found in text classification [3], image and video captioning [4,5], speech recognition [6,7], and action and gesture recognition [8][9][10]. The success of these deep learning models lies in the complex feature representations they learn from the training data and encoding the temporal information.…”

mentioning

confidence: 99%

Looking Under the Hood: Visualizing What LSTMs Learn

Patil

Draper

Beveridge

2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

LOOKING UNDER THE HOOD: VISUALIZING WHAT LSTMS LEARN Recurrent Neural Networks (RNNs) such as Long Short Term Memory (LSTM) and Gated Recurrent Units (GRUs) have been successful in many applications involving sequential data. The success of these models lies in the complex feature representations they learn from the training data. One criteria to trust the model is its validation accuracy. However, this can lead to surprises when the network learns properties of the input data, different from what the designer intended and/or the user assumes. As a result, we lack confidence in even high-performing networks when they are deployed in applications with novel input data, or where the cost of failure is very high.Thus understanding and visualizing what recurrent networks have learned becomes essential.Visualizations of RNN models are better established in the field of natural language processing than in computer vision. This work presents visualizations of what recurrent networks, particularly LSTMs, learn in the domain of action recognition, where the inputs are sequences of 3D human poses, or skeletons. The goal of the thesis is to understand the properties learned by a network with regard to an input action sequence, and how it will generalize to novel inputs. This thesis presents two methods for visualizing concepts learned by RNNs in the domain of action recognition, providing an independent insight into the working of the recognition model.The first visualization method shows the sensitivity of joints over time in a video sequence. The second visualization method generates synthetic videos that maximize the responses of a class label or hidden unit within a set of known anatomical constraints. These techniques are combined in a visualization tool called SkeletonVis to help developers and users gain insights into models embedded in RNNs for action recognition. We present case studies on NTU-RGBD, a popular data set for action recognition, to reveal properties learnt by a trained LSTM network.ii

show abstract

A Comprehensive Survey of Deep Learning for Image Captioning

Cited by 631 publications

References 123 publications

Geometry-aware Relational Exemplar Attention for Dense Captioning

Geometry-aware Relational Exemplar Attention for Dense Captioning

Deep Learning para Geração Automática de Legenda de Imagem

Looking Under the Hood: Visualizing What LSTMs Learn

Contact Info

Product

Resources

About