“…The complex nature of egocentric videos raises a variety of challenges, such as egomotion [60], partially visible or occluded objects, and environmental bias [53,72,77,84,88], which limit the performance of traditional, third-person, approaches when used in first person action recognition (FPAR) [14,15]. The community's interest has quickly grown [16,17,19,83] in recent years, thanks to the possibilities that these data open for the evaluation and understanding of human behavior, leading to the design of novel architectures [30,51,52,91,104]. While the use of optical flow has been the de-facto procedure [14][15][16][17]19,41] in FPAR, the interest has recently shifted towards more lightweight alternatives, such as gaze [27,59,71], audio [9,52,77], depth [32], skeleton [32], and inertial measurements [41], to enable motion modeling in online settings.…”