DRG: Dual Relation Graph for Human-Object Interaction Detection

Gao, Chen; Xu, Jiarui; Zou, Yuliang; Huang, Jia-Bin

doi:10.1007/978-3-030-58610-2_41

Cited by 182 publications

(284 citation statements)

References 48 publications

Supporting

Mentioning

282

Contrasting

Order By: Relevance

“…Most of the recent researches focused on two types of visual information: appearance features and the spatial relationship. For particular, Gao et al [21] proposed an humancentric attention module for learning highlight informative regions. Zhang et al [22] proposed a spatially conditioned graph neural network to compute the messages of nodes and graph features for predicting the interactions.…”

Section: Human-object Interaction Detectionmentioning

confidence: 99%

“…Inspired from prior work [21], we create multiple couples of binary images which represent the bounding boxes of the primary agent and the objects to exploit the spatial position relationship of them. The first image of the image combination always illustrates the location of the main target and the second is for the surrounding detected objects.…”

Section: ) Spatial Relation Modulementioning

confidence: 99%

See 1 more Smart Citation

Context-Aware Emotion Recognition Based on Visual Relationship Detection

et al. 2021

View full text Add to dashboard Cite

Emotion recognition, which is a part of affective computing, draws a lot of attention from researchers because of its broad applications. Unlike previous approaches with the aim to recognize humans' emotional state using facial expression, speech or gesture, some researchers see the potential of the contextual information from the scene. Hence, in addition to the employment of the main subject, the general background data is also considered as the complementary cues for emotion prediction. However, most of the existing works still have some limitations in deeply exploring the scene-level context. In this paper, to fully exploit the essences of context, we propose the emotional state prediction method based on visual relationship detection between the main target and the adjacent objects from the background. Specifically, we utilize both the spatial and semantic features of objects in the scene to calculate the influences of all context-related elements and their properties of impact (positive, negative, or neutral) on the main subject by a modified attention mechanism. After that, the model incorporates those features with scene context and body features of the target person to predict their emotional states. Our experimental results achieve state-of-the-art performance on the CAER-S dataset and competitive results on the EMOTIC benchmark.

show abstract

Section: Human-object Interaction Detectionmentioning

confidence: 99%

Section: ) Spatial Relation Modulementioning

confidence: 99%

Context-Aware Emotion Recognition Based on Visual Relationship Detection

et al. 2021

View full text Add to dashboard Cite

show abstract

“…In HOI detection, one must additionally predict the location of both correctly, Subject and Object boxes, i.e., each box must have an overlap larger than 50% with its corresponding ground-truth box. To date, a body of work have approached the HOI detection problem [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52], [53], [54], [55]. Several of these works do not explicitly integrate spatial information with regard to position, size, or layout of the involved human and objects (e.g., [53]), or integrate this information or part of it in a non-transparent way in the neural network (e.g., [49], [46], [50]).…”

Section: Human Object Interaction (Hoi) Detectionmentioning

confidence: 99%

Probing Spatial Clues: Canonical Spatial Templates for Object Relationship Understanding

2021

View full text Add to dashboard Cite

Humans often leverage spatial clues to categorize scenes in a fraction of a second. This form of intelligence is very relevant in time-critical situations (e.g., when driving a car) and valuable to transfer to automated systems. This work investigates the predictive power of solely processing spatial clues for scene understanding in 2D images and compares such an approach with the predictive power of visual appearance. To this end, we design the laboratory task of predicting the identity of two objects (e.g., "man" and "horse") and their relationship or predicate (e.g., "riding") given exclusively the ground truth bounding box coordinates of both objects. We also measure the performance attainable in Human Object Interaction (HOI) detection, a real-world spatial task, which includes a setting where ground truth boxes are not available at test time. An additional goal is to identify the principles necessary to effectively represent a spatial template, that is, the visual region in which two objects involved in a relationship expressed by a predicate occur. We propose a scale-, mirror-, and translation-invariant representation that captures the spatial essence of the relationship, i.e., a canonical spatial representation. Tests in two benchmarks reveal:(1) High performance is attainable by using exclusively spatial information in all tasks. (2) In HOI detection, the canonical template outperforms the rest of spatial, visual, and several state-of-the-art baselines. (3) Simple fusion of visual and spatial features substantially improves performance. (4) Our methods fare remarkably well with a small amount of data and rare categories. Our results obtained on the Visual Genome (VG) and the Humans Interacting with Common Objects -Detection (HICO-DET) datasets indicate that great predictive power can be obtained from spatial clues alone, opening up possibilities for performing fast scene understanding at a glance.INDEX TERMS Spatial understanding, spatial layout, computer vision, vision and scene understanding. I. INTRODUCTION 1A Well-researched concept in cognitive science is the 2 gist, or the initial representation of a scene obtained 3 in a brief glance. The gist may include semantic content 4 (e.g., "is a classroom"), the identity of a few objects (e.g., 5"there are books"), and the spatial layout [1]. Humans can 6 categorize scenes in a fraction of a second (∼13-250 ms) 7[1], [2]. Generally, more detailed scenes and finer-grained 8

show abstract

“…The objective of Human Object Interaction (HOI) detection is to locate humans and objects and to recognise their interactions. Previous studies [32][33][34][35][36][37] show promising results of HOI sensing by decoupling it into the detection and classification of objects. In particular, the results of human and object detection first come from an object detector pre-trained, and then a pair of combined proposals for human objects interaction classification.…”

Section: Human Object Interactionmentioning

confidence: 99%

Study on Temperature Variance for SimCLR based Activity Recognition

Kumar¹

2021

Preprint

View full text Add to dashboard Cite

Human Activity Recognition (HAR) is a process to automatically detect human activities based on stream data generated from various sensors, including inertial sensors, physiological sensors, location sensors, cameras, time, and many others. In this paper, we propose a robust SimCLR model for human activity recognition with a temperature variance study. In this work, SimCLR, a contrasting learning technique is optimized via regulating the temperature for visual representations, is incorporated for improving the HAR performance in healthcare.

show abstract

DRG: Dual Relation Graph for Human-Object Interaction Detection

Cited by 182 publications

References 48 publications

Context-Aware Emotion Recognition Based on Visual Relationship Detection

Context-Aware Emotion Recognition Based on Visual Relationship Detection

Probing Spatial Clues: Canonical Spatial Templates for Object Relationship Understanding

Study on Temperature Variance for SimCLR based Activity Recognition

Contact Info

Product

Resources

About