Structured Attentions for Visual Question Answering

Zhu, Chen; Zhao, Yanpeng; Huang, Shuaiyi; Tu, Kewei; Ma, Yi

doi:10.1109/iccv.2017.145

Cited by 94 publications

(54 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The current dominant framework for VQA systems consists of an image encoder, a question encoder, multimodal fusion, and an answer predictor. In lieu of directly using visual features from CNN-based feature extractors, [56,11,41,33,49,38,63,36] explored various image attention mechanisms to locate regions that are relevant to the question. To learn a better representation of the question, [33,38,11] proposed to perform question-guided image attention and image-guided question attention collaboratively, to merge knowledge from both visual and textual modalities in the encoding stage.…”

Section: Visual Question Answeringmentioning

confidence: 99%

Relation-Aware Graph Attention Network for Visual Question Answering

Gan

Cheng

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

344

201

View full text Add to dashboard Cite

In order to answer semantically-complicated questions about an image, a Visual Question Answering (VQA) model needs to fully understand the visual scene in the image, especially the interactive dynamics between different objects. We propose a Relation-aware Graph Attention Network (ReGAT), which encodes each image into a graph and models multi-type inter-object relations via a graph attention mechanism, to learn question-adaptive relation representations. Two types of visual object relations are explored: (i) Explicit Relations that represent geometric positions and semantic interactions between objects; and (ii) Implicit Relations that capture the hidden dynamics between image regions. Experiments demonstrate that ReGAT outperforms prior state-of-the-art approaches on both VQA 2.0 and VQA-CP v2 datasets. We further show that Re-GAT is compatible to existing VQA architectures, and can be used as a generic relation encoder to boost the model performance for VQA. 1

show abstract

Section: Visual Question Answeringmentioning

confidence: 99%

Relation-Aware Graph Attention Network for Visual Question Answering

Gan

Cheng

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

344

201

View full text Add to dashboard Cite

show abstract

“…By automatically ignoring irrelevant information from the data, neural networks can selectively focus on important features. This approach has achieved great success in Natural Language Processing (NLP) [3], image captioning [40] and VQA [46]. There are many variants of the attention mechanism.…”

Section: Related Workmentioning

confidence: 99%

Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering

Gao

Jiang

You

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

326

176

View full text Add to dashboard Cite

Learning effective fusion of multi-modality features is at the heart of visual question answering. We propose a novel method of dynamically fusing multi-modal features with intra-and inter-modality information flow, which alternatively pass dynamic information between and across the visual and language modalities. It can robustly capture the high-level interactions between language and vision domains, thus significantly improves the performance of visual question answering. We also show that the proposed dynamic intra-modality attention flow conditioned on the other modality can dynamically modulate the intramodality attention of the target modality, which is vital for multimodality feature fusion. Experimental evaluations on the VQA 2.0 dataset show that the proposed method achieves state-of-the-art VQA performance. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method.

show abstract

“…The first work to use graph networks in VQA is [34], which combines dependency parses of questions and scene graph representations of abstract scenes. [45] proposes modeling structured visual attention over a Conditional Random Field on image regions. A recent work, [28], conditions on a question to learn a graph representation of an image, capturing object interactions with the relevant neighbours via spatial graph convolutions.…”

Section: Graph Network and Contextualized Representationsmentioning

confidence: 99%

Language-Conditioned Graph Networks for Relational Reasoning

Rohrbach

Darrell

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

156

View full text Add to dashboard Cite

Solving grounded language tasks often requires reasoning about relationships between objects in the context of a given task. For example, to answer the question "What color is the mug on the plate?" we must check the color of the specific mug that satisfies the "on" relationship with respect to the plate. Recent work has proposed various methods capable of complex relational reasoning. However, most of their power is in the inference structure, while the scene is represented with simple local appearance features. In this paper, we take an alternate approach and build contextualized representations for objects in a visual scene to support relational reasoning. We propose a general framework of Language-Conditioned Graph Networks (LCGN), where each node represents an object, and is described by a context-aware representation from related objects through iterative message passing conditioned on the textual input. E.g., conditioning on the "on" relationship to the plate, the object "mug" gathers messages from the object "plate" to update its representation to "mug on the plate", which can be easily consumed by a simple classifier for answer prediction. We experimentally show that our LCGN approach effectively supports relational reasoning and improves performance across several tasks and datasets.

show abstract

Structured Attentions for Visual Question Answering

Cited by 94 publications

References 24 publications

Relation-Aware Graph Attention Network for Visual Question Answering

Relation-Aware Graph Attention Network for Visual Question Answering

Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering

Language-Conditioned Graph Networks for Relational Reasoning

Contact Info

Product

Resources

About