“…There are a large number of works on vision-based human activity recognition for robots. These works infer the semantic label of the overall activity or localize actions in the complex activity for better human-robot interactions [20], [2], [10], assistive robotics [14], [34]. Given the input RGB/RGB-D videos [28], [17], [8], or 3D human joint motions [21], [26], or from other inertial/location sensors [9], [22], they train the perception model using fully or weekly labeled actions [17], [7], [13], or locations of annotated human/their interactive objects [30], [24].…”