2022
DOI: 10.1609/aaai.v36i3.20229
|View full text |Cite
|
Sign up to set email alerts
|

Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics

Abstract: Human-Object Interaction (HOI) detection is an essential task to understand human-centric images from a fine-grained perspective. Although end-to-end HOI detection models thrive, their paradigm of parallel human/object detection and verb class prediction loses two-stage methods' merit: object-guided hierarchy. The object in one HOI triplet gives direct clues to the verb to be predicted. In this paper, we aim to boost end-to-end models with object-guided statistical priors. Specifically, We propose to utilize a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 23 publications
(3 citation statements)
references
References 52 publications
0
3
0
Order By: Relevance
“…To enhance this representation, we leverage the CLIP model (Radford et al 2021) to transfer its vision-language knowledge to the interaction feature. This is different from other methods (Liao et al 2022;Yuan et al 2022a;Ning et al 2023;Wan et al 2023), which adopt CLIP to distill the HOI classification in the final stage for few-shot detection. Experimental results on public datasets show the superiority of our proposed model.…”
Section: Extract Multi-modal Features Basedmentioning
confidence: 75%
See 1 more Smart Citation
“…To enhance this representation, we leverage the CLIP model (Radford et al 2021) to transfer its vision-language knowledge to the interaction feature. This is different from other methods (Liao et al 2022;Yuan et al 2022a;Ning et al 2023;Wan et al 2023), which adopt CLIP to distill the HOI classification in the final stage for few-shot detection. Experimental results on public datasets show the superiority of our proposed model.…”
Section: Extract Multi-modal Features Basedmentioning
confidence: 75%
“…Recently, the CLIP model (Radford et al 2021) has demonstrated strong generalization to various downstream tasks (Gu et al 2022;Li et al 2022a;Esmaeilpour et al 2022), and several studies have attempted to leverage the knowledge from CLIP to HOI detection. E.g., Dong et al (2022), Yuan et al (2022a), Wan et al (2023) and Zhang et al (2023) integrate text embeddings generated from the CLIP to enhance the representation of semantic features, and have achieved promising results. In contrast, Liao et al (2022) leverage the CLIP knowledge for interaction classification and visual feature distillation, but such a kind of distillation may introduce ambiguity when multiple HOI triplets exist in the scenario.…”
Section: Exploiting Vision-language Modelsmentioning
confidence: 99%
“…The twostage methods (Gao et al 2020;Li et al 2020a;Ulutan, Iftekhar, and Manjunath 2020;Wan et al 2023;Zhong et al 2020;Yang and Zou 2020;Zhang, Campbell, and Gould 2021;Park, Park, and Lee 2023) use an independent detector to obtain object locations and categories, followed by specific modules for human-object association and interaction recognition. In contrast, the one-stage paradigm (Zhong et al 2022;Zhou and Chi 2019;Wang et al 2020;Zhong et al 2021;Yuan et al 2022b) (Radford et al 2021;Li et al 2021Li et al , 2022aGao et al 2021;Devlin et al 2018) has demonstrated remarkable generalization capabilities across various downstream tasks (Du et al 2022;Feng et al 2022;Gu et al 2021;Li, Savarese, and Hoi 2022), thus were also transferred into the HOI detection task by previous methods. GEN-VLTK (Liao et al 2022) employs image feature distillation and initializes classifiers with HOI prompts.…”
Section: Related Workmentioning
confidence: 99%