This paper aims to tackle the practically very challenging problem of efficient and accurate hand pose estimation from single depth images. A dedicated two-step regression forest pipeline is proposed: given an input hand depth image, step one involves mainly estimation of 3D location and in-plane rotation of the hand using a pixelwise regression forest. This is utilized in step two which delivers final hand estimation by a similar regression forest model based on the entire hand image patch. Moreover, our estimation is guided by internally executing a 3D hand kinematic chain model. For an unseen test image, the kinematic model parameters are estimated by a proposed dynamically weighted scheme. As a combined effect of these proposed building blocks, our approach is able to deliver more precise estimation of hand poses. In practice, our approach works at 15.6 frame-per-second (FPS) on an average laptop when implemented in CPU, which is further sped-up to 67.2 FPS when running on GPU. In addition, we introduce and make publicly available a data-glove annotated depth image dataset covering various hand shapes and gestures, which enables us conducting quantitative analyses on real-world hand images. The effectiveness of our approach is verified empirically on both synthetic and the annotated real-world datasets for hand pose estimation, as well as related applications including part-based labeling and gesture classification. In addition to empirical studies, the consistency property of our approach is also theoretically analyzed.