Disentangling Latent Hands for Image Synthesis and Pose Estimation

Yang, Linlin; Yao, Angela

doi:10.1109/cvpr.2019.01011

Cited by 125 publications

(101 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They usually require a good initialization; otherwise they are susceptible to getting stuck in local minima. Discriminative methods learn a direct mapping from visual observations to hand poses [23,27,10,13,31,2]. Thanks to large-scale annotated datasets [31,29,23], deep learningbased discriminative methods have shown very strong performance in the hand pose estimation task.…”

Section: Related Workmentioning

confidence: 99%

“…To make the 3D pose annotations consistent for RHD, we follow [31,2] and modify the palm joint in STB to the wrist point. Similar to [31,2,19,27], we use 10 sequences for training and the other 2 for testing.…”

Section: Datasets and Evaluation Metricsmentioning

confidence: 99%

“…Cai et al [2] first proposed the use of labelled depth maps as regularizers to boost RGB-based methods. Yang et al [27] introduced a disentangled representation so that viewpoint can be used as a weak label. Inspired by these works, we aim to leverage multiple modalities as weak labels for enhancing RGB-based hand pose estimation.…”

Section: Introductionmentioning

confidence: 99%

“…VAEs are an attrac-tive class of deep generative models which can be learned on large-scale, high-dimensional datasets. They have been shown to capture highly complex relationships across multiple modalities [21,24,26] and have also been applied to RGB-based pose estimation in the past [19,27]. However, both [19] and [27] learn a single shared latent space and as a result must compromise on pose reconstruction accuracy.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Aligning Latent Spaces for 3D Hand Pose Estimation

Yang

Lee

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

View full text Add to dashboard Cite

Hand pose estimation from monocular RGB inputs is a highly challenging task. Many previous works for monocular settings only used RGB information for training despite the availability of corresponding data in other modalities such as depth maps. In this work, we propose to learn a joint latent representation that leverages other modalities as weak labels to improve RGB-based hand pose estimation. By design, our architecture is highly flexible in embedding various diverse modalities such as heat maps, depth maps and point clouds. In particular, we find that encoding and decoding the point cloud of the hand surface can improve the quality of the joint latent representations. Experiments show that with the aid of other modalities during training, our proposed method boosts the accuracy of RGB-based hand pose estimation systems and significantly outperforms state-of-the-art on two public benchmarks.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Datasets and Evaluation Metricsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Aligning Latent Spaces for 3D Hand Pose Estimation

Yang

Lee

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

View full text Add to dashboard Cite

show abstract

“…[2] improves the robustness of pose estimation methods by synthesizing more images from the augmented skeletons, which is achieved by obtaining more unseen skeletons instead of leveraging the unseen combinations of the specified factor (pose) and unspecified factors (background) in the existing dataset like ours. The most related work is [57], which proposes an disentangled VAE to learn the specified (pose) and additional (appearance) factors. However, our method explicitly makes the appearance factor orthogonal to the pose during training process, while [2] only guarantees that the pose factor does not contain information about the image contents.…”

Section: Related Workmentioning

confidence: 99%

Disentangling Pose from Appearance in Monochrome Hand Images

Twigg

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

Hand pose estimation from monocular 2D image is challenging due to the variation in lighting, appearance, and background. While some success has been achieved using deep neural networks, they typically require collecting a large dataset that adequately samples all the axes of variation of hand images. It would therefore be useful to find a representation of hand pose which is independent of the image appearance (like hand texture, lighting, background), so that we can synthesize unseen images by mixing poseappearance combinations. In this paper, we present a novel technique that disentangles the representation of pose from a complementary appearance factor in 2D monochrome images. We supervise this disentanglement process using a network that learns to generate images of hand using specified pose+appearance features. Unlike previous work, we do not require image pairs with a matching pose; instead, we use the pose annotations already available and introduce a novel use of cycle consistency to ensure orthogonality between the factors. Experimental results show that our self-disentanglement scheme successfully decomposes the hand image into pose and its complementary appearance features of comparable quality as the method using paired data. Additionally, training the model with extra synthesized images with unseen hand-appearance combinations by re-mixing pose and appearance factors from different images can improve the 2D pose estimation performance.

show abstract

Weakly Supervised 3D Hand Pose Estimation via Biomechanical Constraints

Spurr

Iqbal

Molchanov

et al. 2020

Lecture Notes in Computer Science

109

View full text Add to dashboard Cite

Estimating 3D hand pose from 2D images is a difficult, inverse problem due to the inherent scale and depth ambiguities. Current stateof-the-art methods train fully supervised deep neural networks with 3D ground-truth data. However, acquiring 3D annotations is expensive, typically requiring calibrated multi-view setups or labour intensive manual annotations. While annotations of 2D keypoints are much easier to obtain, how to efficiently leverage such weakly-supervised data to improve the task of 3D hand pose prediction remains an important open question.The key difficulty stems from the fact that direct application of additional 2D supervision mostly benefits the 2D proxy objective but does little to alleviate the depth and scale ambiguities. Embracing this challenge we propose a set of novel losses that constrain the prediction of a neural network to lie within the range of biomechanically feasible 3D hand configurations. We show by extensive experiments that our proposed constraints significantly reduce the depth ambiguity and allow the network to more effectively leverage additional 2D annotated images. For example, on the challenging freiHAND dataset, using additional 2D annotation without our proposed biomechanical constraints reduces the depth error by only 15%, whereas the error is reduced significantly by 50% when the proposed biomechanical constraints are used.

show abstract

Disentangling Latent Hands for Image Synthesis and Pose Estimation

Cited by 125 publications

References 32 publications

Aligning Latent Spaces for 3D Hand Pose Estimation

Aligning Latent Spaces for 3D Hand Pose Estimation

Disentangling Pose from Appearance in Monochrome Hand Images

Weakly Supervised 3D Hand Pose Estimation via Biomechanical Constraints

Contact Info

Product

Resources

About