Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images

Malinowski, Mateusz; Rohrbach, Marcus; Fritz, Mario

doi:10.1109/iccv.2015.9

Cited by 525 publications

(455 citation statements)

References 35 publications

Supporting

Mentioning

450

Contrasting

Unclassified

Order By: Relevance

“…Given the availability of new datasets, an array of visual QA models have been proposed to tackle QA tasks. The proposed models range from SVM classifiers and probabilistic inference (Malinowski and Fritz 2014) to recurrent neural networks (Gao et al 2015;Malinowski et al 2015;Ren et al 2015a) and convolutional networks . Visual Genome aims to capture the details of the images with diverse question types and long answers.…”

Section: Question Answeringmentioning

confidence: 99%

“…3.1). With this information, MS-COCO and VQA provide a fertile training and testing ground for models aimed at tasks for accurate object detection, segmentation, and summary-level image captioning (Kiros et al 2014;Mao et al 2014;Karpathy and Fei-Fei 2015) as well as basic QA (Ren et al 2015a;Malinowski et al 2015;Gao et al 2015;Malinowski and Fritz 2014). For example, a state-of-the-art model (Karpathy and Fei-Fei 2015) provides a description of one MS-COCO image in To understand images thoroughly, we believe three key elements need to be added to existing datasets: a grounding of visual concepts to language (Kiros et al 2014), a more complete set of descriptions and QAs for each image based on multiple image regions (Johnson et al 2015), and a formalized representation of the components of an image (Hayes 1978).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Krishna

Zhu

Groth³

et al. 2017

Int J Comput Vis

4,384

3,078

View full text Add to dashboard Cite

Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) to answer correctly that "the person is riding a horse-drawn carriage." In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 108K images where each image has an average of 35 objects, 26 attributes, and 21 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest

show abstract

Section: Question Answeringmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Krishna

Zhu

Groth³

et al. 2017

Int J Comput Vis

4,384

3,078

View full text Add to dashboard Cite

show abstract

“…These models utilize CNN to extract semantic representations from images and encode questions via RNN, especially LSTM, and then combine two modalities with an appropriate joint learning method. Many previous methods [1][2][3]7] adopt this approach, while some [5,8] solve VQA task by modifying the basic idea. Besides LSTM, these approaches [3, 9-11] adopted GRU to extract high-level semantic and some [4,12,13] utilized CNN to encode question.…”

Section: A Joint Embeddingmentioning

confidence: 99%

“…There are several methods different from above ones, which addressed VQA task as a multi-way classification problem. In [7], the model fed both image and question into LSTM at each time step, and then generated the answer. Wu et al [14] extracted attributes from image and generated descriptions of image as input of LSTM to generate answer by sequence-to-sequence learning.…”

Section: A Joint Embeddingmentioning

confidence: 99%

Multimodal Cross-guided Attention Networks for Visual Question Answering

Liu¹,

Gong²,

Yang³

et al. 2018

Advances in Intelligent Systems Research

View full text Add to dashboard Cite

Abstract-Visual Question Answering (VQA) is an attractive topic combining computer vision with natural language processing. It is more challenging than text-based question answering because of its multimodal nature. The VQA reasoning process requires both effective semantic embedding and fine-grained visual comprehension. Existing approaches predominantly infer answers from visual spatial information, while neglecting important semantic information in questions and the guidance information between images and questions. To remedy this, we imitate the human mechanism of cross-reasoning about visual and textual information and propose a multimodal cross-guided attention network (MCAN) for VQA which employs a cross-guided joint learning strategy with a gated activation learning method, which can simultaneously capture both rich visual spatial information and significant semantic information. We evaluate the proposed model on two public datasets: VQA dataset and COCO-QA dataset. Extensive experiments show state-of-the-art performance on the datasets.

show abstract

“…Visual perception is generally considered less ambiguous than language. In the computer vision community large collections of images and their language descriptions are being created from which a machine can learn interesting perceptual knowledge (e.g., [24,40]). The models of [14,38] are capable of learning semantic common sense knowledge from images and their textual descriptions and of imagining visual scenes that may contain more objects than the ones mentioned in a text.…”

Section: How Can a Machine Learn Common Sense And World Knowledge Fromentioning

confidence: 99%

Argumentation mining: How can a machine acquire common sense and world knowledge?

Moens

2018

AAC

View full text Add to dashboard Cite

Abstract. Argumentation mining is an advanced form of human language understanding by the machine. This is a challenging task for a machine. When sufficient explicit discourse markers are present in the language utterances, the argumentation can be interpreted by the machine with an acceptable degree of accuracy. However, in many real settings, the mining task is difficult due to the lack or ambiguity of the discourse markers, and the fact that a substantial amount of knowledge needed for the correct recognition of the argumentation, its composing elements and their relationships is not explicitly present in the text, but makes up the background knowledge that humans possess when interpreting language. In this article 1 we focus on how the machine can automatically acquire the needed common sense and world knowledge. As very few research has been done in this respect, many of the ideas proposed in this article are tentative, but start being researched.We give an overview of the latest methods for human language understanding that map language to a formal knowledge representation that facilitates other tasks (for instance, a representation that is used to visualize the argumentation or that is easily shared in a decision or argumentation support system). Most current systems are trained on texts that are manually annotated. Then we go deeper into the new field of representation learning that nowadays is very much studied in computational linguistics. This field investigates methods for representing language as statistical concepts or as vectors, allowing straightforward methods of compositionality. The methods often use deep learning and its underlying neural network technologies to learn concepts from large text collections in an unsupervised way (i.e., without the need for manual annotations). We show how these methods can help the argumentation mining process, but also demonstrate that these methods need further research to automatically acquire the necessary background knowledge and more specifically common sense and world knowledge. We propose a number of ways to improve the learning of common sense and world knowledge by exploiting textual and visual data, and touch upon how we can integrate the learned knowledge in the argumentation mining process.

show abstract

Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images

Cited by 525 publications

References 35 publications

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Multimodal Cross-guided Attention Networks for Visual Question Answering

Argumentation mining: How can a machine acquire common sense and world knowledge?

Contact Info

Product

Resources

About