“…While the image-based HOI detectors show great performance on image datasets, they may perform poorly on video datasets because they cannot exploit the temporal cues required to distinguish between some continuous interactions, such as open or close a door (Fouhey et al, 2018). Hence, a few works (Qi et al, 2018;Chiou et al, 2021;Cong et al, 2021;Ji et al, 2021;Wang et al, 2021;Tu et al, 2022b) are proposed to leverage the temporal dependencies between frames and demonstrate superior performance to the image-based methods. However, these approaches do not consider the human gaze as an additional feature while it often provides valuable information about human intentions (Johansson et al, 2001;Land and Hayhoe, 2001;Hayhoe et al, 2003;Baldauf and Deubel, 2010;Belardinelli et al, 2016).…”