2021
DOI: 10.48550/arxiv.2107.13083
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Is Object Detection Necessary for Human-Object Interaction Recognition?

Ying Jin,
Yinpeng Chen,
Lijuan Wang
et al.

Abstract: This paper revisits human-object interaction (HOI) recognition at image level without using supervisions of object location and human pose. We name it detectionfree HOI recognition, in contrast to the existing detection-supervised approaches which rely on object and keypoint detections to achieve state of the art. With our method, not only the detection supervision is evitable, but superior performance can be achieved by properly using image-text pre-training (such as CLIP) and the proposed Log-Sum-Exp Sign (L… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 35 publications
0
2
0
Order By: Relevance
“…The training set includes 38116 images and the test set includes 9658 images. For a fair comparison, we follow the standard practice and mainly focus on those previous methods that do not require extra supervision (Fang et al, 2018) or data (Li et al, 2020b;2019b;Jin et al, 2021). By default, we choose PVTv2-b2 (Wang et al, 2021b) as the ViT backbone.…”
Section: Main Results I: Human-object Interaction Recognitionmentioning
confidence: 99%
“…The training set includes 38116 images and the test set includes 9658 images. For a fair comparison, we follow the standard practice and mainly focus on those previous methods that do not require extra supervision (Fang et al, 2018) or data (Li et al, 2020b;2019b;Jin et al, 2021). By default, we choose PVTv2-b2 (Wang et al, 2021b) as the ViT backbone.…”
Section: Main Results I: Human-object Interaction Recognitionmentioning
confidence: 99%
“…Gan et al (2017) and Zhao et al (2020) have suggested style-guided captioning, but also employ training over paired data. CLIP (2021) marked a turning point in visionlanguage perception, and has been utilized for vision-related tasks by various distillation techniques Song et al, 2022;Jin et al, 2021;Gal et al, 2021;Khandelwal et al, 2022). Recent captioning methods use CLIP for reducing training time (Mokady et al, 2021), improved captions (Shen et al, 2021;Luo et al, 2022a,b;Cornia et al, 2021;Kuo and Kira, 2022), and in zero-shot settings (Su et al, 2022;Tewel et al, 2022).…”
Section: Related Workmentioning
confidence: 99%