Camera-Space Hand Mesh Recovery via Semantic Aggregation and Adaptive 2D-1D Registration

Chen, Xingyu; Liu, Yufeng; Ma, Chongyang; Chang, Jianlong; Wang, Huayan; Tian, Chen; Guo, Xiaofang; Wan, Pengfei; Zheng, Wen

doi:10.1109/cvpr46437.2021.01307

Cited by 69 publications

(51 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Vertex-based methods [18,46,11,50,51] predict 3D vertex coordinates directly, which usually follow a procedure of 2D encoding, 2D-to-3D mapping, and 3D decoding. For example, Kulon et al [46] designed an encoderdecoder based on ResNet [30], global pooling, and spiral convolution (SpiralConv) [49] to obtain 3D vertex coordinates.…”

Section: Related Workmentioning

confidence: 99%

“…For feature lifting, two problems should be concerned: (1) how to collect 2D features and (2) how to map them to 3D domain. To this end, previous methods [46,18,11] tend to embed F e as a latent vector via the global average pooling operation. Then, the latent vector is mapped to 3D domain with a fully connected layer (FC), and vertex features are obtained with vector re-arrangement.…”

Section: Feature Lifting Modulementioning

confidence: 99%

“…Recent researches report pixel-aligned feature extraction based on 2D landmarks and pixel-aligned feature pooling [66,23,26,83,89]. Heatmap H p is usually employed to encode 2D landmarks [23,18,11,85], which derives more accurate landmarks compared with direct regression of the 2D positions L p [48,74,7,71,76].…”

Section: Feature Lifting Modulementioning

confidence: 99%

“…A typical pipeline for single-view hand reconstruction includes three phases: 2D encoding, 2D-to-3D mapping, and 3D decoding. In 2D encoding, existing approaches (such as [46,11,50,51]) usually adopt computationally intensive networks [30,75] to handle this highly nonlinear task, which are hard to deploy on mobile platforms. Instead, if naively leveraging a mature mobile network (e.g., [32]) that is not tailored for our target task, the reconstruction accuracy dramatically degrades [22].…”

Section: Introductionmentioning

confidence: 99%

“…Furthermore, the PVL transforms 2D pose encodings to 3D vertex features based on a learnable lifting matrix, resulting in enhanced 3D accuracy and temporal consistency. As compared to the traditional approach (i.e., fully connected operation in a latent space [46,18,11]), our feature lifting module also significantly reduces the model size. In addition, two extra strategies are developed: (1) we construct a uniformly distributed hand pose dataset as the complement; (2) during training, consistency loss like [24,79] is designed to further improve the temporal performance of the non-sequential MobRecon.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

MobRecon: Mobile-Friendly Hand Mesh Reconstruction from Monocular Image

Chen¹,

Liu²,

Dong³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

In this work, we propose a framework for singleview hand mesh reconstruction, which can simultaneously achieve high reconstruction accuracy, fast inference speed, and temporal coherence. Specifically, for 2D encoding, we propose lightweight yet effective stacked structures. Regarding 3D decoding, we provide an efficient graph operator, namely depth-separable spiral convolution. Moreover, we present a novel feature lifting module for bridging the gap between 2D and 3D representations. This module starts with a map-based position regression (MapReg) block to integrate the merits of both heatmap encoding and position regression paradigms to improve 2D accuracy and temporal coherence. Furthermore, MapReg is followed by pose pooling and pose-to-vertex lifting approaches, which transform 2D pose encodings to semantic features of 3D vertices. Overall, our hand reconstruction framework, called MobRecon, comprises affordable computational costs and miniature model size, which reaches a high inference speed of 83FPS on Apple A14 CPU. Extensive experiments on popular datasets such as FreiHAND, RHD, and HO3Dv2 demonstrate that our MobRecon achieves superior performance on reconstruction accuracy and temporal coherence. Our code is publicly available at https://github. com/SeanChenxy/HandMesh.

show abstract