Hand pose is emerging as an important interface for human-computer interaction.
IntroductionThe problem of tracking articulated objects has attracted increasing attention in the field of computer vision, as it provides a natural method of Human Computer Interaction (HCI) [9], [10]. Inference of the pose and gesture of the human hand is an important challenge in this area. Active vision approaches for hand pose estimation using depth sensors such as Leap Motion and Kinect have made considerable progress in recent years. These cameras actively dissipate electromagnetic waves into the scene, probing how far each point in the field of view is away from the imaging device. While active vision techniques provide good shape information and robustness to clutter, they present several limitations, including: large energy consumption, a poor form factor, less accurate near distance coverage, and poor outdoor usage.In contrast, in this paper we explore the use of passive vision for the estimation of hand pose using a stereovision system composed of adjacent RGB cameras. Such a camera rig does not project light into the scene, and therefore has complementary advantages to depth imaging, including less energy consumption. However, hand pose estimation in this context is a more challenging computer vision problem, one that has received less attention in the literature. We address this gap by proposing a novel framework that combines jointly optimal depth and hand pose estimation in a unified framework using Markov-chain Monte Carlo (MCMC) sampling and deep learning. Our research is motivated by the possibility of estimating articulation with the input of stereo cameras from an egocentric, stereoscopic perspective. We are inspired by human vision, which can efficiently discern articulations and perform tracking activities with passive, binocular input. As our experiments show, our approach is compatible with inexpensive stereo vision systems, such as the rig shown in Figure 1, to produce robust hand pose inference. The proposed technique also relies on a robust hand segmentation procedure. We do not address hand segmentation in this paper as there is a large body of literature on this subject (see, for example, [1], [21]).
ContributionUnlike several approaches to pose estimation from stereo capture that explicitly recover disparity before regressing for the pose in a sequential manner we present a joint optimization approach that is robust against potential errors
Hand Pose Estimation Using Deep Stereovision and Markov-chain Monte Carlo