This paper presents a neural network to estimate a detailed depth map of the foreground human in a single RGB image. The result captures geometry details such as cloth wrinkles, which are important in visualization applications. To achieve this goal, we separate the depth map into a smooth base shape and a residual detail shape and design a network with two branches to regress them respectively. We design a training strategy to ensure both base and detail shapes can be faithfully learned by the corresponding network branches. Furthermore, we introduce a novel network layer to fuse a rough depth map and surface normals to further improve the final result. Quantitative comparison with fused 'ground truth' captured by real depth cameras and qualitative examples on unconstrained Internet images demonstrate the strength of the proposed method. Our code will be released at Link
Figure 1: Comparison between the state-of-the-art learning-based multi-view stereo approaches [4, 44, 45] and MVS-Net+Ours. (a)-(d): Reconstructed point clouds of MVSNet [44], R-MVSNet [45], Point-MVSNet [4] and MVSNet+Ours. (e) and (f): The relationship between reconstruction accuracy and GPU memory or run-time. The resolution of input images is 1152 × 864.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.