2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.00857
|View full text |Cite
|
Sign up to set email alerts
|

Explainable and Explicit Visual Reasoning Over Scene Graphs

Abstract: We aim to dismantle the prevalent black-box neural architectures used in complex visual reasoning tasks, into the proposed eXplainable and eXplicit Neural Modules (XNMs), which advance beyond existing neural module networks towards using scene graphs -objects as nodes and the pairwise relationships as edges -for explainable and explicit reasoning with structured knowledge. XNMs allow us to pay more attention to teach machines how to "think", regardless of what they "look". As we will show in the paper, by usin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
119
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 213 publications
(127 citation statements)
references
References 27 publications
0
119
0
Order By: Relevance
“…Neural Module Networks. Recently, the idea of decomposing the network into neural modules is popular in some vision-language tasks such as VQA [3,15], visual grounding [29,46], and visual reasoning [37]. In these tasks, highquality module layout can be obtained by parsing the provided sentences like questions in VQA.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Neural Module Networks. Recently, the idea of decomposing the network into neural modules is popular in some vision-language tasks such as VQA [3,15], visual grounding [29,46], and visual reasoning [37]. In these tasks, highquality module layout can be obtained by parsing the provided sentences like questions in VQA.…”
Section: Related Workmentioning
confidence: 99%
“…being resolved to establish a robust cross-modal connection between them. Indeed, image captioning is not the only model that can easily exploit the dataset bias to captioning even without looking at the image, almost all existing models for vision-language tasks such as visual Q&A [18,8,37] have been spotted mode collapse to certain dataset idiosyncrasies, failed to reproduce the diversity of our world -the more complex the task is, the more severe the collapse will be, such as image paragraph generation [22] and visual dialog [5]. For example, in MS-COCO [27] training set, as the co-occurrence chance of "man" and "standing" is 11% large, a state-of-the-art captioner [2] is very likely to genera- ''sheep+grassy hill'' / "sheep": 1.3% ''sheep+field'' / "sheep": 28% ''dog+santa hat'' / "dog": 0.13% ''dog+hat'' / "dog": 1.9% "man+milking" / "man": 0.023% "man+standing" / "man": 11% "hydrant+spewing" / "hydrant": 0.61% "hydrant+sitting" / "hydrant": 14% Figure 2: By comparing our CNM with a non-module baseline (an upgraded version of Up-Down [2]), we have three interesting findings in tackling the dataset bias: (a) more accurate grammar.…”
Section: Introductionmentioning
confidence: 99%
“…However, one of their main usages is reasoning about the scene, as they outline a structured representation of the image content. Among these works, [55] uses scene graphs for explainable and explicit reasoning with structured knowledge. Aditya et al [56] use directed and labeled scene description graph for reasoning in image captioning, retrieval, and visual question answering applications.…”
Section: Related Workmentioning
confidence: 99%
“…(a), the nodes and edges in scene graphs are objects and visual relationships, respectively. Moreover, scene graph is an indispensable knowledge representation for many highlevel vision tasks such as image captioning [69,66,68,24], visual reasoning [53,14], and VQA [42,19]. A straightforward solution for Scene Graph Generation (SGG) is in an independent fashion: detecting object bounding boxes by an existing object detector, and then predicting the object classes and their pairwise relationships separately [37,74,67,52].…”
Section: Introductionmentioning
confidence: 99%