Estimating 3D hand pose from single RGB images is a highly ambiguous problem that relies on an unbiased training dataset. In this paper, we analyze cross-dataset generalization when training on existing datasets. We find that approaches perform well on the datasets they are trained on, but do not generalize to other datasets or in-the-wild scenarios. As a consequence, we introduce the first large-scale, multi-view hand dataset that is accompanied by both 3D hand pose and shape annotations. For annotating this realworld dataset, we propose an iterative, semi-automated 'human-in-the-loop' approach, which includes hand fitting optimization to infer both the 3D pose and shape for each sample. We show that methods trained on our dataset consistently perform well when tested on other datasets. Moreover, the dataset allows us to train a network that predicts the full articulated hand shape from a single RGB image. The evaluation set can serve as a benchmark for articulated hand shape estimation. arXiv:1909.04349v3 [cs.CV] 13 Sep 2019 Training Set Evaluation Set Figure 2: Examples from our proposed dataset showing images (top row) and hand shape annotations (bottom row). The training set contains composited images from green screen recordings, whereas the evaluation set contains images recorded indoors and outdoors. The dataset features several subjects as well as object interactions.the key aspects is that we record synchronized images from multiple views, an idea already used previously in [2,27]. The multiple views remove many ambiguities and ease both the manual annotation and automated fitting. The second key aspect of our approach is a semi-automated humanin-the-loop labeling procedure with a strong bootstrapping component. Starting from a sparse set of 2D keypoint annotations (e.g., finger tip annotations) and semi-automatically generated segmentation masks, we propose a hand fitting method that fits a deformable hand model [25] to a set of multi-view input. This fitting yields both 3D hand pose and shape annotation for each view. We then train a multi-view 3D hand pose estimation network using these annotations. This network predicts the 3D hand pose for unlabeled samples in our dataset along with a confidence measure. By verifying confident predictions and annotating least-confident samples in an iterative procedure, we acquire 11592 annotations with moderate manual effort by a human annotator.