Explainable and Explicit Visual Reasoning Over Scene Graphs

Shi, Jiaxin; Zhang, Hanwang; Li, Juanzi

doi:10.1109/cvpr.2019.00857

Cited by 213 publications

(127 citation statements)

References 27 publications

Supporting

Mentioning

119

Contrasting

Order By: Relevance

“…Neural Module Networks. Recently, the idea of decomposing the network into neural modules is popular in some vision-language tasks such as VQA [3,15], visual grounding [29,46], and visual reasoning [37]. In these tasks, highquality module layout can be obtained by parsing the provided sentences like questions in VQA.…”

Section: Related Workmentioning

confidence: 99%

“…being resolved to establish a robust cross-modal connection between them. Indeed, image captioning is not the only model that can easily exploit the dataset bias to captioning even without looking at the image, almost all existing models for vision-language tasks such as visual Q&A [18,8,37] have been spotted mode collapse to certain dataset idiosyncrasies, failed to reproduce the diversity of our world -the more complex the task is, the more severe the collapse will be, such as image paragraph generation [22] and visual dialog [5]. For example, in MS-COCO [27] training set, as the co-occurrence chance of "man" and "standing" is 11% large, a state-of-the-art captioner [2] is very likely to genera- ''sheep+grassy hill'' / "sheep": 1.3% ''sheep+field'' / "sheep": 28% ''dog+santa hat'' / "dog": 0.13% ''dog+hat'' / "dog": 1.9% "man+milking" / "man": 0.023% "man+standing" / "man": 11% "hydrant+spewing" / "hydrant": 0.61% "hydrant+sitting" / "hydrant": 14% Figure 2: By comparing our CNM with a non-module baseline (an upgraded version of Up-Down [2]), we have three interesting findings in tackling the dataset bias: (a) more accurate grammar.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Learning to Collocate Neural Modules for Image Captioning

Zhang

Cai

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

View full text Add to dashboard Cite

We do not speak word by word from scratch; our brain quickly structures a pattern like STH DO STH AT SOME-PLACE and then fill in the detailed descriptions. To render existing encoder-decoder image captioners such humanlike reasoning, we propose a novel framework: learning to Collocate Neural Modules (CNM), to generate the "inner pattern" connecting visual encoder and language decoder. Unlike the widely-used neural module networks in visual Q&A, where the language (i.e., question) is fully observable, CNM for captioning is more challenging as the language is being generated and thus is partially observable. To this end, we make the following technical contributions for CNM training: 1) compact module design -one for function words and three for visual content words (e.g., noun, adjective, and verb), 2) soft module fusion and multistep module execution, robustifying the visual reasoning in partial observation, 3) a linguistic loss for module controller being faithful to part-of-speech collocations (e.g., adjective is before noun). Extensive experiments on the challenging MS-COCO image captioning benchmark validate the effectiveness of our CNM image captioner. In particular, CNM achieves a new state-of-the-art 127.9 CIDEr-D on Karpathy split and a single-model 126.0 c40 on the official server. CNM is also robust to few training samples, e.g., by training only one sentence per image, CNM can halve the performance loss compared to a strong baseline. Function Module Attribute Module Attribute Module Object Module Object Module Relation Module Relation Module Relation Module Relation Module Function Module Function Module Object Module Object Module (c) The caption generation process of CNM.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Learning to Collocate Neural Modules for Image Captioning

Zhang

Cai

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, one of their main usages is reasoning about the scene, as they outline a structured representation of the image content. Among these works, [55] uses scene graphs for explainable and explicit reasoning with structured knowledge. Aditya et al [56] use directed and labeled scene description graph for reasoning in image captioning, retrieval, and visual question answering applications.…”

Section: Related Workmentioning

confidence: 99%

Spatiotemporal Relationship Reasoning for Pedestrian Intent Prediction

Liu

Adeli

Cao

et al. 2020

IEEE Robot. Autom. Lett.

140

View full text Add to dashboard Cite

Reasoning over visual data is a desirable capability for robotics and vision-based applications. Such reasoning enables forecasting the next events or actions in videos. In recent years, various models have been developed based on convolution operations for prediction or forecasting, but they lack the ability to reason over spatiotemporal data and infer the relationships of different objects in the scene. In this paper, we present a framework based on graph convolution to uncover the spatiotemporal relationships in the scene for reasoning about pedestrian intent. A scene graph is built on top of segmented object instances within and across video frames. Pedestrian intent, defined as the future action of crossing or not-crossing the street, is very crucial piece of information for autonomous vehicles to navigate safely and more smoothly. We approach the problem of intent prediction from two different perspectives and anticipate the intention-to-cross within both pedestrian-centric and location-centric scenarios. In addition, we introduce a new dataset designed specifically for autonomousdriving scenarios in areas with dense pedestrian populations: the Stanford-TRI Intent Prediction (STIP) dataset. Our experiments on STIP and another benchmark dataset show that our graph modeling framework is able to predict the intention-to-cross of the pedestrians with an accuracy of 79.10% on STIP and 79.28% on Joint Attention for Autonomous Driving (JAAD) dataset up to one second earlier than when the actual crossing happens. These results outperform baseline and previous work. Please refer to http://stip.stanford.edu/ for the dataset and code.Index Terms-spatiotemporal graphs, forecasting, graph neural networks, autonomous-driving. Recent work [19]-[23] introduced pedestrian intent prediction and have typically tackled the problem by observing pedestrian-specific features such as location, velocity, and

show abstract

“…(a), the nodes and edges in scene graphs are objects and visual relationships, respectively. Moreover, scene graph is an indispensable knowledge representation for many highlevel vision tasks such as image captioning [69,66,68,24], visual reasoning [53,14], and VQA [42,19]. A straightforward solution for Scene Graph Generation (SGG) is in an independent fashion: detecting object bounding boxes by an existing object detector, and then predicting the object classes and their pairwise relationships separately [37,74,67,52].…”

Section: Introductionmentioning

confidence: 99%

Counterfactual Critic Multi-Agent Training for Scene Graph Generation

Chen

Zhang

Xiao

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

154

105

View full text Add to dashboard Cite

Scene graphs -objects as nodes and visual relationships as edges -describe the whereabouts and interactions of objects in an image for comprehensive scene understanding. To generate coherent scene graphs, almost all existing methods exploit the fruitful visual context by modeling message passing among objects. For example, "person" on "bike" can help to determine the relationship "ride", which in turn contributes to the confidence of the two objects. However, we argue that the visual context is not properly learned by using the prevailing cross-entropy based supervised learning paradigm, which is not sensitive to graph inconsistency: errors at the hub or non-hub nodes should not be penalized equally. To this end, we propose a Counterfactual critic Multi-Agent Training (CMAT) approach. CMAT is a multi-agent policy gradient method that frames objects into cooperative agents, and then directly maximizes a graph-level metric as the reward. In particular, to assign the reward properly to each agent, CMAT uses a counterfactual baseline that disentangles the agent-specific reward by fixing the predictions of other agents. Extensive validations on the challenging Visual Genome benchmark show that CMAT achieves a state-of-the-art performance by significant gains under various settings and metrics.

show abstract

Explainable and Explicit Visual Reasoning Over Scene Graphs

Cited by 213 publications

References 27 publications

Learning to Collocate Neural Modules for Image Captioning

Learning to Collocate Neural Modules for Image Captioning

Spatiotemporal Relationship Reasoning for Pedestrian Intent Prediction

Counterfactual Critic Multi-Agent Training for Scene Graph Generation

Contact Info

Product

Resources

About