2015 IEEE International Conference on Computer Vision (ICCV) 2015
DOI: 10.1109/iccv.2015.9
|View full text |Cite
|
Sign up to set email alerts
|

Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images

Abstract: We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose Neural-Image-QA, an endto-end formulation to this problem for which all parts are trained jointly. In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is conditioned on visual and natural language input (image and question). Our approach Neural-Image-QA doubles the pe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
450
0
2

Year Published

2017
2017
2018
2018

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 525 publications
(455 citation statements)
references
References 35 publications
3
450
0
2
Order By: Relevance
“…Given the availability of new datasets, an array of visual QA models have been proposed to tackle QA tasks. The proposed models range from SVM classifiers and probabilistic inference (Malinowski and Fritz 2014) to recurrent neural networks (Gao et al 2015;Malinowski et al 2015;Ren et al 2015a) and convolutional networks . Visual Genome aims to capture the details of the images with diverse question types and long answers.…”
Section: Question Answeringmentioning
confidence: 99%
See 1 more Smart Citation
“…Given the availability of new datasets, an array of visual QA models have been proposed to tackle QA tasks. The proposed models range from SVM classifiers and probabilistic inference (Malinowski and Fritz 2014) to recurrent neural networks (Gao et al 2015;Malinowski et al 2015;Ren et al 2015a) and convolutional networks . Visual Genome aims to capture the details of the images with diverse question types and long answers.…”
Section: Question Answeringmentioning
confidence: 99%
“…3.1). With this information, MS-COCO and VQA provide a fertile training and testing ground for models aimed at tasks for accurate object detection, segmentation, and summary-level image captioning (Kiros et al 2014;Mao et al 2014;Karpathy and Fei-Fei 2015) as well as basic QA (Ren et al 2015a;Malinowski et al 2015;Gao et al 2015;Malinowski and Fritz 2014). For example, a state-of-the-art model (Karpathy and Fei-Fei 2015) provides a description of one MS-COCO image in To understand images thoroughly, we believe three key elements need to be added to existing datasets: a grounding of visual concepts to language (Kiros et al 2014), a more complete set of descriptions and QAs for each image based on multiple image regions (Johnson et al 2015), and a formalized representation of the components of an image (Hayes 1978).…”
Section: Introductionmentioning
confidence: 99%
“…These models utilize CNN to extract semantic representations from images and encode questions via RNN, especially LSTM, and then combine two modalities with an appropriate joint learning method. Many previous methods [1][2][3]7] adopt this approach, while some [5,8] solve VQA task by modifying the basic idea. Besides LSTM, these approaches [3, 9-11] adopted GRU to extract high-level semantic and some [4,12,13] utilized CNN to encode question.…”
Section: A Joint Embeddingmentioning
confidence: 99%
“…There are several methods different from above ones, which addressed VQA task as a multi-way classification problem. In [7], the model fed both image and question into LSTM at each time step, and then generated the answer. Wu et al [14] extracted attributes from image and generated descriptions of image as input of LSTM to generate answer by sequence-to-sequence learning.…”
Section: A Joint Embeddingmentioning
confidence: 99%
“…Visual perception is generally considered less ambiguous than language. In the computer vision community large collections of images and their language descriptions are being created from which a machine can learn interesting perceptual knowledge (e.g., [24,40]). The models of [14,38] are capable of learning semantic common sense knowledge from images and their textual descriptions and of imagining visual scenes that may contain more objects than the ones mentioned in a text.…”
Section: How Can a Machine Learn Common Sense And World Knowledge Fromentioning
confidence: 99%