2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.00209
|View full text |Cite
|
Sign up to set email alerts
|

MUREL: Multimodal Relational Reasoning for Visual Question Answering

Abstract: Multimodal attentional networks are currently state-ofthe-art models for Visual Question Answering (VQA) tasks involving real images. Although attention allows to focus on the visual content relevant to the question, this simple mechanism is arguably insufficient to model complex reasoning features required for VQA or other high-level tasks.In this paper, we propose MuRel, a multimodal relational network which is learned end-to-end to reason over real images. Our first contribution is the introduction of the M… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
193
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
3
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 292 publications
(194 citation statements)
references
References 33 publications
1
193
0
Order By: Relevance
“…Another line of research focuses on implicit relations, where no explicit semantic or spatial relations are used to construct the graph. Instead, all the relations are implicitly captured by an attention module or via higher-order methods over the fully-connected graph of an input image [46,21,6,57], to model the interactions between detected objects. For example, [46] reasons over all the possible pairs of objects in an image via the use of simple MLPs.…”
Section: Relational Reasoningmentioning
confidence: 99%
“…Another line of research focuses on implicit relations, where no explicit semantic or spatial relations are used to construct the graph. Instead, all the relations are implicitly captured by an attention module or via higher-order methods over the fully-connected graph of an input image [46,21,6,57], to model the interactions between detected objects. For example, [46] reasons over all the possible pairs of objects in an image via the use of simple MLPs.…”
Section: Relational Reasoningmentioning
confidence: 99%
“…Fully-supervised Spatio-Temporal Grounding has been developed by combining with other tasks, such as object tracking [32], video captioning [34] and visual question answering [2]. Yang et al [32] add language description on an object tracking dataset [10] to make it for grounding task and propose a grounding and tracking integration model.…”
Section: Related Workmentioning
confidence: 99%
“…"they". Video grounding on untrimmed videos is of great significance while more challenging than trimmed video, since the untrimmed video contains large temporal incoherence caused by camera motion and camera shot cut 2 .…”
Section: Our Approach 31 Problem Setupmentioning
confidence: 99%
“…Typically, VQA is often considered as a classification task with fixed categories [2,4,10,57]. In this case, to obtain the optimal joint representations for question answering, some VQA methods focus on exploring the full interactions between two modalities [6,10,22,23,[60][61][62]. Early developments resorted to bilinear-based approaches to improve the one-layer fusion of two modalities, e.g., compact bilinear pooling [10], low-rank bilinear pooling [23] and factorized bilinear pooling [61].…”
Section: Related Work 21 Visual Question Answeringmentioning
confidence: 99%