Joint estimation of the human body is suitable for many fields such as human–computer interaction, autonomous driving, video analysis and virtual reality. Although many depth-based researches have been classified and generalized in previous review or survey papers, the point cloud-based pose estimation of human body is still difficult due to the disorder and rotation invariance of the point cloud. In this review, we summarize the recent development on the point cloud-based pose estimation of the human body. The existing works are divided into three categories based on their working principles, including template-based method, feature-based method and machine learning-based method. Especially, the significant works are highlighted with a detailed introduction to analyze their characteristics and limitations. The widely used datasets in the field are summarized, and quantitative comparisons are provided for the representative methods. Moreover, this review helps further understand the pertinent applications in many frontier research directions. Finally, we conclude the challenges involved and problems to be solved in future researches.
Joint estimation of human body in point cloud is a key step for tracking human movements. In this work, we present a geometric method to achieve detection of the joints from a single-frame point cloud captured using a Time-of-Flight (ToF) camera. Three-dimensional (3D) human silhouette, as global feature of the single-frame point cloud, is extracted based on the pre-processed data, the angle and aspect ratio of the silhouette are subsequently utilized to perform pose recognition, and then 14 joints of human body are derived via geometric features of 3D silhouette. To verify this method, we test on an in-house captured 3D dataset containing 1200-frame depth images, which can be categorized into four different poses (upright, raising hands, parallel arms, and akimbo). Furthermore, we test on a subset of the G3D dataset. By hand-labelling the joints of each human body as the ground truth for validation and benchmarks, the average normalized error of our geometric method is less than 5.8 cm. When the distance threshold from the ground truth is 10 cm, the results demonstrate that our proposed method delivers improved performance with an average accuracy in the range of 90%. INDEX TERMS Depth camera, human pose detection, joint detection, sensor systems and applications.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.