A hand pose tracking benchmark from stereo matching

Estimating 3D hand pose from single RGB images is a highly ambiguous problem that relies on an unbiased training dataset. In this paper, we analyze cross-dataset generalization when training on existing datasets. We find that approaches perform well on the datasets they are trained on, but do not generalize to other datasets or in-the-wild scenarios. As a consequence, we introduce the first large-scale, multi-view hand dataset that is accompanied by both 3D hand pose and shape annotations. For annotating this realworld dataset, we propose an iterative, semi-automated 'human-in-the-loop' approach, which includes hand fitting optimization to infer both the 3D pose and shape for each sample. We show that methods trained on our dataset consistently perform well when tested on other datasets. Moreover, the dataset allows us to train a network that predicts the full articulated hand shape from a single RGB image. The evaluation set can serve as a benchmark for articulated hand shape estimation. arXiv:1909.04349v3 [cs.CV] 13 Sep 2019 Training Set Evaluation Set Figure 2: Examples from our proposed dataset showing images (top row) and hand shape annotations (bottom row). The training set contains composited images from green screen recordings, whereas the evaluation set contains images recorded indoors and outdoors. The dataset features several subjects as well as object interactions.the key aspects is that we record synchronized images from multiple views, an idea already used previously in [2,27]. The multiple views remove many ambiguities and ease both the manual annotation and automated fitting. The second key aspect of our approach is a semi-automated humanin-the-loop labeling procedure with a strong bootstrapping component. Starting from a sparse set of 2D keypoint annotations (e.g., finger tip annotations) and semi-automatically generated segmentation masks, we propose a hand fitting method that fits a deformable hand model [25] to a set of multi-view input. This fitting yields both 3D hand pose and shape annotation for each view. We then train a multi-view 3D hand pose estimation network using these annotations. This network predicts the 3D hand pose for unlabeled samples in our dataset along with a confidence measure. By verifying confident predictions and annotating least-confident samples in an iterative procedure, we acquire 11592 annotations with moderate manual effort by a human annotator.

show abstract

“…Annotations can also be provided manually on hand images [24,28,35]. However, the annotation is limited to visible regions of the hand.…”

Section: Related Workmentioning

confidence: 99%

“…Stereo Tracking Benchmark (STB) [35] dataset is one of the first and most commonly used datasets to report performance of 3D keypoint estimation from a single RGB image. The annotations are acquired manually limiting the setup to hand poses where most regions of the hands are visible.…”

Section: Considered Datasetsmentioning

confidence: 99%

FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape From Single RGB Images

Zimmermann

Ceylan

Yang

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

384

369

View full text Add to dashboard Cite

show abstract

“…This is the approach of several papers on hand pose estimation from stereo capture, including [2], [3] and [14].…”

Section: Methodsmentioning

confidence: 99%

“…Lastly, unlike the work in [14], which utilizes a stateof-the-art tracking method that is sensitive to erroneous initialization and anatomical hand size as discussed in [17], we propose a semi-generative approach that is experimentally proven to work on different sizes and tones of hand without pre-calibration.…”

Section: Hand Pose Estimation Using Deep Stereovision and Markov-chaimentioning

confidence: 99%

“…An example of this is [2], where a robust technique that focuses on depth recovery of hand pose is presented, specifically with the aim of later using it for hand pose estimation. [14] also proposed using recovered disparity for pose estimation. It utilizes an Adaptive GMM segmentation [19] to localize the hand skin region before recovering disparity based on stereo matches.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Hand Pose Estimation Using Deep Stereovision and Markov-Chain Monte Carlo

Basaru

Child

Alonso

et al. 2017

2017 IEEE International Conference on Computer Vision Workshops (ICCVW)

View full text Add to dashboard Cite

Hand pose is emerging as an important interface for human-computer interaction. IntroductionThe problem of tracking articulated objects has attracted increasing attention in the field of computer vision, as it provides a natural method of Human Computer Interaction (HCI) [9], [10]. Inference of the pose and gesture of the human hand is an important challenge in this area. Active vision approaches for hand pose estimation using depth sensors such as Leap Motion and Kinect have made considerable progress in recent years. These cameras actively dissipate electromagnetic waves into the scene, probing how far each point in the field of view is away from the imaging device. While active vision techniques provide good shape information and robustness to clutter, they present several limitations, including: large energy consumption, a poor form factor, less accurate near distance coverage, and poor outdoor usage.In contrast, in this paper we explore the use of passive vision for the estimation of hand pose using a stereovision system composed of adjacent RGB cameras. Such a camera rig does not project light into the scene, and therefore has complementary advantages to depth imaging, including less energy consumption. However, hand pose estimation in this context is a more challenging computer vision problem, one that has received less attention in the literature. We address this gap by proposing a novel framework that combines jointly optimal depth and hand pose estimation in a unified framework using Markov-chain Monte Carlo (MCMC) sampling and deep learning. Our research is motivated by the possibility of estimating articulation with the input of stereo cameras from an egocentric, stereoscopic perspective. We are inspired by human vision, which can efficiently discern articulations and perform tracking activities with passive, binocular input. As our experiments show, our approach is compatible with inexpensive stereo vision systems, such as the rig shown in Figure 1, to produce robust hand pose inference. The proposed technique also relies on a robust hand segmentation procedure. We do not address hand segmentation in this paper as there is a large body of literature on this subject (see, for example, [1], [21]). ContributionUnlike several approaches to pose estimation from stereo capture that explicitly recover disparity before regressing for the pose in a sequential manner we present a joint optimization approach that is robust against potential errors Hand Pose Estimation Using Deep Stereovision and Markov-chain Monte Carlo

show abstract

Resolving hand‐object occlusion for mixed reality with joint deep learning and model optimization

Feng

Shum

Morishima

2020

Computer Animation & Virtual

View full text Add to dashboard Cite

By overlaying virtual imagery onto the real world, mixed reality facilitates diverse applications and has drawn increasing attention. Enhancing physical in-hand objects with a virtual appearance is a key component for many applications that require users to interact with tools such as surgery simulations. However, due to complex hand articulations and severe hand-object occlusions, resolving occlusions in hand-object interactions is a challenging topic. Traditional tracking-based approaches are limited by strong ambiguities from occlusions and changing shapes, while reconstruction-based methods show a poor capability of handling dynamic scenes. In this article, we propose a novel real-time optimization system to resolve hand-object occlusions by spatially reconstructing the scene with estimated hand joints and masks. To acquire accurate results, we propose a joint learning process that shares information between two models and jointly estimates hand poses and semantic segmentation. To facilitate the joint learning system and improve its accuracy under occlusions, we propose an occlusion-aware RGB-D hand data set that mitigates the ambiguity through precise annotations and photorealistic appearance. Evaluations show more consistent overlays compared with literature, and a user study verifies a more realistic experience.

show abstract

A hand pose tracking benchmark from stereo matching

Cited by 119 publications

References 16 publications

FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape From Single RGB Images

FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape From Single RGB Images

Hand Pose Estimation Using Deep Stereovision and Markov-Chain Monte Carlo

Resolving hand‐object occlusion for mixed reality with joint deep learning and model optimization

Contact Info

Product

Resources

About