Spatio-Temporal Interaction Graph Parsing Networks for Human-Object Interaction Recognition

Wang, Ning; Zhu, Guangming; Zhang, Liang; Shen, Peiyi; Li, Hongsheng; Cong, Hua

doi:10.1145/3474085.3475636

Cited by 24 publications

(9 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Instead of RNNs, Qi et al (2018) propose a Graph Parsing Network (GPN) to parse the spatio-temporal graphs of human-object interactions. Then, Wang et al (2021) design a two-stream GPN that also incorporates the semantic features. In contrast to the graph-based methods, Sun et al (2021) propose an instancebased architecture to separately reason each human-object pair instance.…”

Section: Video-based Hoi Detectionmentioning

confidence: 99%

“…While the image-based HOI detectors show great performance on image datasets, they may perform poorly on video datasets because they cannot exploit the temporal cues required to distinguish between some continuous interactions, such as open or close a door (Fouhey et al, 2018). Hence, a few works (Qi et al, 2018;Chiou et al, 2021;Cong et al, 2021;Ji et al, 2021;Wang et al, 2021;Tu et al, 2022b) are proposed to leverage the temporal dependencies between frames and demonstrate superior performance to the image-based methods. However, these approaches do not consider the human gaze as an additional feature while it often provides valuable information about human intentions (Johansson et al, 2001;Land and Hayhoe, 2001;Hayhoe et al, 2003;Baldauf and Deubel, 2010;Belardinelli et al, 2016).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Human–object interaction prediction in videos through gaze following

Mascaró

Ahn

et al. 2023

Computer Vision and Image Understanding

View full text Add to dashboard Cite

Section: Video-based Hoi Detectionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Human–object interaction prediction in videos through gaze following

Mascaró

Ahn

et al. 2023

Computer Vision and Image Understanding

View full text Add to dashboard Cite

“…Without considering temporal information, these methods fail to detect time-related interactions, restricting their value in practical applications. In contrast, video-based HOI detection is a more practical problem, which however is less explored [35,33,34,36,4,42,17]. [35,36,42] detected HOIs in videos by building graph neural networks to capture spatiotemporal information.…”

Section: Related Workmentioning

confidence: 99%

“…In contrast, video-based HOI detection is a more practical problem, which however is less explored [35,33,34,36,4,42,17]. [35,36,42] detected HOIs in videos by building graph neural networks to capture spatiotemporal information. In [33], HOI "hotspots" can be directly learned from videos by jointly training a video-based action recognition network as well as an anticipation model.…”

Section: Related Workmentioning

confidence: 99%

Video-based Human-Object Interaction Detection from Tubelet Tokens

Tu¹,

Sun²,

Min³

et al. 2022

Preprint

View full text Add to dashboard Cite

We present a novel vision Transformer, named TUTOR, which is able to learn tubelet tokens, served as highly-abstracted spatiotemporal representations, for video-based human-object interaction (V-HOI) detection. The tubelet tokens structurize videos by agglomerating and linking semantically-related patch tokens along spatial and temporal domains, which enjoy two benefits: 1) Compactness: each tubelet token is learned by a selective attention mechanism to reduce redundant spatial dependencies from others; 2) Expressiveness: each tubelet token is enabled to align with a semantic instance, i.e., an object or a human, across frames, thanks to agglomeration and linking. The effectiveness and efficiency of TUTOR are verified by extensive experiments. Results shows our method outperforms existing works by large margins, with a relative mAP gain of 16.14% on VidHOI and a 2 points gain on CAD-120 as well as a 4× speedup.Preprint. Under review.

show abstract

“…Dabral et al [6] analyze the effectiveness of GCNs against Convolutional Networks and Capsule Networks for spatial relation learning. Wang et al [53] propose the STIGPN exploiting the parsed graphs to learn spatiotemporal connection development and discover objects existing in a scene. Although previous methods attain impressive improvements in specific tasks, they are all based on visual features, which are unreliable in real-life HOI activities that contain occlusions between human and object entities.…”

Section: Hoi Recognition In Videosmentioning

confidence: 99%

Geometric Features Informed Multi-person Human-object Interaction Recognition in Videos

Qiao¹,

Men²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

Human-Object Interaction (HOI) recognition in videos is important for analyzing human activity. Most existing work focusing on visual features usually suffer from occlusion in the real-world scenarios. Such a problem will be further complicated when multiple people and objects are involved in HOIs. Consider that geometric features such as human pose and object position provide meaningful information to understand HOIs, we argue to combine the benefits of both visual and geometric features in HOI recognition, and propose a novel Two-level Geometric feature-informed Graph Convolutional Network (2G-GCN). The geometric-level graph models the interdependency between geometric features of humans and objects, while the fusion-level graph further fuses them with visual features of humans and objects. To demonstrate the novelty and effectiveness of our method in challenging scenarios, we propose a new multi-person HOI dataset (MPHOI-72). Extensive experiments on MPHOI-72 (multi-person HOI), CAD-120 (single-human HOI) and Bimanual Actions (two-hand HOI) datasets demonstrate our superior performance compared to state-of-the-arts.

show abstract

Spatio-Temporal Interaction Graph Parsing Networks for Human-Object Interaction Recognition

Cited by 24 publications

References 43 publications

Human–object interaction prediction in videos through gaze following

Human–object interaction prediction in videos through gaze following

Video-based Human-Object Interaction Detection from Tubelet Tokens

Geometric Features Informed Multi-person Human-object Interaction Recognition in Videos

Contact Info

Product

Resources

About