Proceedings of the 29th ACM International Conference on Multimedia 2021
DOI: 10.1145/3474085.3475636
|View full text |Cite
|
Sign up to set email alerts
|

Spatio-Temporal Interaction Graph Parsing Networks for Human-Object Interaction Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 24 publications
(9 citation statements)
references
References 43 publications
0
9
0
Order By: Relevance
“…Instead of RNNs, Qi et al (2018) propose a Graph Parsing Network (GPN) to parse the spatio-temporal graphs of human-object interactions. Then, Wang et al (2021) design a two-stream GPN that also incorporates the semantic features. In contrast to the graph-based methods, Sun et al (2021) propose an instancebased architecture to separately reason each human-object pair instance.…”
Section: Video-based Hoi Detectionmentioning
confidence: 99%
See 1 more Smart Citation
“…Instead of RNNs, Qi et al (2018) propose a Graph Parsing Network (GPN) to parse the spatio-temporal graphs of human-object interactions. Then, Wang et al (2021) design a two-stream GPN that also incorporates the semantic features. In contrast to the graph-based methods, Sun et al (2021) propose an instancebased architecture to separately reason each human-object pair instance.…”
Section: Video-based Hoi Detectionmentioning
confidence: 99%
“…While the image-based HOI detectors show great performance on image datasets, they may perform poorly on video datasets because they cannot exploit the temporal cues required to distinguish between some continuous interactions, such as open or close a door (Fouhey et al, 2018). Hence, a few works (Qi et al, 2018;Chiou et al, 2021;Cong et al, 2021;Ji et al, 2021;Wang et al, 2021;Tu et al, 2022b) are proposed to leverage the temporal dependencies between frames and demonstrate superior performance to the image-based methods. However, these approaches do not consider the human gaze as an additional feature while it often provides valuable information about human intentions (Johansson et al, 2001;Land and Hayhoe, 2001;Hayhoe et al, 2003;Baldauf and Deubel, 2010;Belardinelli et al, 2016).…”
Section: Introductionmentioning
confidence: 99%
“…Without considering temporal information, these methods fail to detect time-related interactions, restricting their value in practical applications. In contrast, video-based HOI detection is a more practical problem, which however is less explored [35,33,34,36,4,42,17]. [35,36,42] detected HOIs in videos by building graph neural networks to capture spatiotemporal information.…”
Section: Related Workmentioning
confidence: 99%
“…In contrast, video-based HOI detection is a more practical problem, which however is less explored [35,33,34,36,4,42,17]. [35,36,42] detected HOIs in videos by building graph neural networks to capture spatiotemporal information. In [33], HOI "hotspots" can be directly learned from videos by jointly training a video-based action recognition network as well as an anticipation model.…”
Section: Related Workmentioning
confidence: 99%
“…Dabral et al [6] analyze the effectiveness of GCNs against Convolutional Networks and Capsule Networks for spatial relation learning. Wang et al [53] propose the STIGPN exploiting the parsed graphs to learn spatiotemporal connection development and discover objects existing in a scene. Although previous methods attain impressive improvements in specific tasks, they are all based on visual features, which are unreliable in real-life HOI activities that contain occlusions between human and object entities.…”
Section: Hoi Recognition In Videosmentioning
confidence: 99%