2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.00224
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Human-Gaze-Target Detection with Transformers

Abstract: This paper proposes an efficient and effective method for joint gaze location detection (GL-D) and gaze object detection (GO-D), i.e., gaze following detection. Current approaches frame GL-D and GO-D as two separate tasks, employing a multi-stage framework where human head crops must first be detected and then be fed into a subsequent GL-D sub-network, which is further followed by an additional object detector for GO-D. In contrast, we reframe the gaze following detection task as detecting human head locations… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 46 publications
(6 citation statements)
references
References 42 publications
0
6
0
Order By: Relevance
“…Gaze Target Detection Methods Furthermore, we conduct comparisons with five recent methods: Chen (Chen et al 2021), Fang (Fang et al 2021), Tu (Tu et al 2022), Bao (Bao, Liu, and Yu 2022), and Miao (Miao, Hoai, and Samaras 2023). These methods have all demonstrated notable performance within the confines of within-dataset evaluations.…”
Section: Comparison Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Gaze Target Detection Methods Furthermore, we conduct comparisons with five recent methods: Chen (Chen et al 2021), Fang (Fang et al 2021), Tu (Tu et al 2022), Bao (Bao, Liu, and Yu 2022), and Miao (Miao, Hoai, and Samaras 2023). These methods have all demonstrated notable performance within the confines of within-dataset evaluations.…”
Section: Comparison Methodsmentioning
confidence: 99%
“…For our method, we use the pre-trained lightweight body pose estimator RTMPose (Jiang et al 2023) and object detector YOLOv3 (Redmon et al 2016). On the other hand, competing methods introduced some other modules, e.g., face detection and depth estimation from the scene (Fang et al 2021), body pose estimation and 3D reconstruction from the scene (Bao, Liu, and Yu 2022), ViT backbone (Tu et al 2022). In order to measure their computation complexity, we also select recent high-speed implementations for them, and compared their inference speed on a single NVIDIA Titan XP GPU.…”
Section: Computational Complexitymentioning
confidence: 99%
“…The main reason could be that the spatio-temporal transformer is trained with noisy gaze cues as the VidHOI dataset lacks ground-truth gaze annotations. The performance of the adopted gaze following model (Chong et al, 2020) might be a limitation of our framework, but could be improved by leveraging more recent works in that field, such as (Tu et al, 2022a;Fang et al, 2021). In addition, even though the gaze does not result in big improvement, other extensions we proposed in the spatio-temporal transformer still boost the model performance and allow us to achieve state-of-the-art in HOI detection and anticipation in videos.…”
Section: Ablation Studymentioning
confidence: 99%
“…Zhong et al (2021) proposed a one-stage method, namely glance and gaze network, which adaptively simulated a set of action-aware points through glance and gaze steps. Tu et al (2022) presented an effective and efficient method for human-gaze-target detection and gaze following based on determination of the relations of salient objects and the human gaze from the global image context.…”
Section: Related Workmentioning
confidence: 99%