The main stage in the development of an algorithm for recognizing human actions is the construction of an informative and distinctive descriptor. As part of the development of a robot control system based on the recognition of human actions, this stage can be decisive. The use of technical vision elements in real conditions introduces a number of difficulties: an inhomogeneous background, uncontrolled working environment, irregular lighting, partial occlusion of the observed object, speed of actions, etc. In this paper, we propose an algorithm for recognizing human actions on complexly structured images based on a 3-D binary descriptor of micro-block difference. The developed algorithm is based on the fusion of multimodal information obtained by depth sensors and cameras of the visible range. The complementarity of information obtained in various ways allows minimizing the influence of external factors on the quality of video content: poor lighting, loss of information during data transmission, noise, etc. Combining data of both modalities ensures the complementary nature of the final video stream, which may contain information inaccessible when working with separate sources. In addition to the main descriptor, the paper proposes to use the analysis of the human skeleton. These data will reduce the recognition error and will focus the attention of the proposed method on smaller actions performed by a person's hands or wrist. The experimental results showed the effectiveness of the proposed algorithm on known data sets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.