Abstract-Human action and activity recognition from videos has attracted an increasing number of researchers in recent years. However, most of the works aim at multimedia retrieval and surveillance applications, but rarely at humanoid household robots, even though the robotic perception of human activities would allow a more natural human-robot interaction (HRI). To encourage future studies in this domain, we present in this work a novel data set specifically designed for the application in HRI scenarios. This Robo-kitchen data set consists of 14 typical kitchen activities recorded in two different stereo-camera setups, and each performed by 17 subjects. To establish a baseline for future work, we extend a state-of-the-art action recognition method to be applicable on the activity classification problem and evaluate it on the Robo-kitchen data set showing promising results.
Abstract-In this paper, a multi-level approach to intention, activity, and motion recognition for a humanoid robot is proposed. Our system processes images from a monocular camera and combines this information with domain knowledge. The recognition works on-line and in real-time, it is independent of the test person, but limited to predefined view-points. Main contributions of this paper are the extensible, multi-level modeling of the robot's vision system, the efficient activity and motion recognition, and the asynchronous information fusion based on generic processing of mid-level recognition results. The complementarity of the activity and motion recognition renders the approach robust against misclassifications. Experimental results on a real-world data set of complex kitchen tasks, e.g., Prepare Cereals or Lay Table, prove the performance and robustness of the multi-level recognition approach.
Abstract-The knowledge about the body orientation of humans can improve speed and performance of many service components of a smart-room. Since many of such components run in parallel, an estimator to acquire this knowledge needs a very low computational complexity. In this paper we address these two points with a fast and efficient algorithm using the smart-room's multiple camera output. The estimation is based on silhouette information only and is performed for each camera view separately. The single view results are fused within a Bayesian filter framework. We evaluate our system on a subset of videos from the CLEAR 2007 dataset [1] and achieve an average correct classification rate of 87.8 %, while the estimation itself just takes 12 ms when four cameras are used.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.