Neuroscience studies have shown that incorporating gaze view with third view perspective has a great influence to correctly infer human behaviors. Given the importance of both first and third person observations for the recognition of human behaviors, we propose a method that incorporates these observations in a technical system to enhance the recognition of human behaviors, thus improving beyond third person observations in a more robust human activity recognition system. First, we present the extension of our proposed semantic reasoning method by including gaze data and external observations as
inputs
to segment and infer human behaviors in complex real-world scenarios. Then, from the obtained results we demonstrate that the combination of gaze and external input sources greatly enhance the recognition of human behaviors. Our findings have been applied to a humanoid robot to
online
segment and recognize the observed human activities with better accuracy when using both input sources; for example, the activity recognition increases from 77% to 82% in our proposed
pancake-making
dataset. To provide completeness of our system, we have evaluated our approach with another dataset with a similar setup as the one proposed in this work, that is, the CMU-MMAC dataset. In this case, we improved the recognition of the activities for the
egg scrambling
scenario from 54% to 86% by combining the external views with the gaze information, thus showing the benefit of incorporating gaze information to infer human behaviors across different datasets.
In this paper we address the basic limitation of Siammask -the state of the art single object tracking and segmentation algorithm. SiamMask requires semi-supervision in that it needs a bounding box to be drawn manually around the object that has to be tracked. This is however not always possible or feasible, and slows down the pipeline even in the best case. We overcome this limitation by using state-of-the-art object detection algorithms: Detectron and YOLO to automatically detect the object and then track using Siammask. We note that YOLO gives better and more meaningful detection of objects in the scene. However, Detectron gives a higher detection speed than YOLO, making the overall detection and tracking process faster.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.