Proceedings of the 28th ACM International Conference on Multimedia 2020
DOI: 10.1145/3394171.3413778
|View full text |Cite
|
Sign up to set email alerts
|

Lighten

Abstract: Analyzing the interactions between humans and objects from a video includes identification of the relationships between humans and the objects present in the video. It can be thought of as a specialized version of Visual Relationship Detection, wherein one of the objects must be a human. While traditional methods formulate the problem as inference on a sequence of video segments, we present a hierarchical approach, LIGHTEN, to learn visual features to effectively capture spatio-temporal cues at multiple granul… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 25 publications
(5 citation statements)
references
References 48 publications
0
5
0
Order By: Relevance
“…In experiments, CAD-120 dataset 1 was used. LIGHTEN 15 and ASSIGN 17 achieve state-of-the-art of sub-activity and affordance, respectively. Therefore, we use these two networks as our baseline.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…In experiments, CAD-120 dataset 1 was used. LIGHTEN 15 and ASSIGN 17 achieve state-of-the-art of sub-activity and affordance, respectively. Therefore, we use these two networks as our baseline.…”
Section: Methodsmentioning
confidence: 99%
“…The network named LIGHTEN uses a hierarchical RNN structure to fuse inter-frame features and inter-segment features. 15 Joanna et al 16 proposed a detection method of compositional action recognition that uses Multi-Layer Perceptron to fuse spatial and temporal features. Considering the asynchronism and sparsity in HOI, Morais et al 17 proposed a hierarchical recurrent spatial-temporal graph network (ASSIGN) to automatically detect the structure of interaction.…”
Section: Introductionmentioning
confidence: 99%
“…Truong and Yoshitaka (2017) refine the S-RNN by additionally considering object-object relations. Sunkesula et al (2020) further improve the model performance by applying learned visual features as the graph nodes. Instead of RNNs, Qi et al (2018) propose a Graph Parsing Network (GPN) to parse the spatio-temporal graphs of human-object interactions.…”
Section: Video-based Hoi Detectionmentioning
confidence: 99%
“…The ability to anticipate subsequent HOIs is beneficial for task planning and danger avoidance. Nevertheless, there are very few studies addressing the HOI anticipation task from the third-person view (Jain et al, 2016;Jiyang Gao and Nevatia, 2017;Truong and Yoshitaka, 2017;Sunkesula et al, 2020). However, these works are conducted on small-scale datasets and cannot be generalized to real-world applications.…”
Section: Introductionmentioning
confidence: 99%
“…The field of HOI detection has been further enriched by recent advances in the neural heuristic analysis of video content, with the application of graph neural networks becoming increasingly compelling, from the initial generic approach evolved to a nuanced analysis that takes into account specific body parts and poses, while utilizing graph models and attentional mechanisms to improve accuracy. In recent years, the research community has introduced innovative frameworks to address multifaceted challenges that address the problem of modeling interactions at different levels of granularity [35], [36], overcoming the challenges posed by label skew [37], and the complexities of effectively integrating multimodal data [38]. These innovations have significantly advanced application areas such as scene understanding and VQA.…”
Section: Introductionmentioning
confidence: 99%