Extreme 3D Face Reconstruction: Seeing Through Occlusions

Tran, Anh; Hassner, Tal; Masi, Iacopo; Paz, Eran; Nirkin, Yuval; Medioni, Gérard

doi:10.1109/cvpr.2018.00414

Cited by 172 publications

(78 citation statements)

References 61 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We evaluate RingNet qualitatively and quantitatively and compare our results with publicly available methods, namely: PRNet (ECCV 2018 [9]), Extreme3D (CVPR 2018 [35]) and 3DMM-CNN (CVPR 2017 [34]).…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Learning to Regress 3D Face Shape and Expression From an Image Without 3D Supervision

Sanyal

Bolkart

Feng

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

298

286

View full text Add to dashboard Cite

Figure 1: Without 3D supervision, RingNet learns a mapping from the pixels of a single image to the 3D facial parameters of the FLAME model [21]. Top: Images are from the CelebA dataset [22]. Bottom: estimated shape, pose and expression. AbstractThe estimation of 3D face shape from a single image must be robust to variations in lighting, head pose, expression, facial hair, makeup, and occlusions. Robustness requires a large training set of in-the-wild images, which by construction, lack ground truth 3D shape. To train a network without any 2D-to-3D supervision, we present RingNet, which learns to compute 3D face shape from a single image. Our key observation is that an individual's face shape is constant across images, regardless of expression, pose, lighting, etc. RingNet leverages multiple images of a person and automatically detected 2D face features. It uses a novel loss that encourages the face shape to be similar when the identity is the same and different for different people. We achieve invariance to expression by representing the face using the FLAME model. Once trained, our method takes a single image and outputs the parameters of FLAME, which can be readily animated. Additionally we create a new database of faces "not quite in-the-wild" (NoW) with 3D head scans and high-resolution images of the subjects in a wide variety of conditions. We evaluate publicly available methods and find that RingNet is more accurate than methods that use 3D supervision. The dataset, model, and results are available for research purposes at

show abstract

Section: Methodsmentioning

confidence: 99%

“…NoW is more complex than previous datasets and we use it to evaluate all recent methods with publicly available implementations. Specifically we compare with [34], [35] and [9], which are trained with 3D supervision. Despite not having any 2D-to-3D supervision our RingNet method recovers more accurate 3D face shape.…”

Section: Introductionmentioning

confidence: 99%

Learning to Regress 3D Face Shape and Expression From an Image Without 3D Supervision

Sanyal

Bolkart

Feng

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

298

286

View full text Add to dashboard Cite

show abstract

“…To this end, we introduce X2Face, a novel self-supervised network architecture that can be used for face puppeteering of a source face given a driving vector. fitting of 3DMMs by including high level details [34,41], taking into account additional images [33] or 3D scans [4], or by learning 3DMM parameters directly from RGB data without ground truth labels [39,2]. Please refer to Zollhöfer et.…”

Section: Introductionmentioning

confidence: 99%

X2Face: A Network for Controlling Face Generation Using Images, Audio, and Pose Codes

Wiles

Koepke

Zisserman

2018

Lecture Notes in Computer Science

391

385

View full text Add to dashboard Cite

The objective of this paper is a neural network model that controls the pose and expression of a given face, using another face or modality (e.g. audio). This model can then be used for lightweight, sophisticated video and image editing. We make the following three contributions. First, we introduce a network, X2Face, that can control a source face (specified by one or more frames) using another face in a driving frame to produce a generated frame with the identity of the source frame but the pose and expression of the face in the driving frame. Second, we propose a method for training the network fully self-supervised using a large collection of video data. Third, we show that the generation process can be driven by other modalities, such as audio or pose codes, without any further training of the network. The generation results for driving a face with another face are compared to state-of-the-art self-supervised/supervised methods. We show that our approach is more robust than other methods, as it makes fewer assumptions about the input data. We also show examples of using our framework for video face editing.* Denotes equal contribution.The source face is instantiated from a single or multiple source frames, which are extracted from the same face track. The driving vector may come from multiple modalities: a driving frame from the same or another video face track, pose information, or audio information; this is illustrated in Fig. 1. The generated frame resulting from X2Face has the identity, hairstyle, etc. of the source face but the properties of the driving vector (e.g. the given pose, if pose information is given; or the driving frame's expression/pose, if a driving frame is given). The network is trained in a self-supervised manner using pairs of source and driving frames. These frames are input to two subnetworks: the embedding network and the driving network (see Fig. 2). By controlling the information flow in the network architecture, the model learns to factorise the problem. The embedding network learns an embedded face representation for the source face -effectively face frontalisation; the driving network learns how to map from this embedded face representation to the generated frame via an embedding, named the driving vector. The X2Face network architecture is described in Section 3.1, and the self-supervised training framework in Section 3.2. In addition we make two further contributions. First, we propose a method for linearly regressing from a set of labels (e.g. for head pose) or features (e.g. from audio) to the driving vector; this is described in Section 4. The performance is evaluated in Section 5, where we show (i) the robustness of the generated results compared to state-of-the-art self-supervised [45] and supervised [1] methods; and (ii) the controllability of the network using other modalities, such as audio or pose. The second contribution, described in Section 6, shows how the embedded face representation can be used for video face editing, e.g. adding facial decorations in the manne...

show abstract

“…Although their goal is to robustly find the head pose, deocclusion in not within the scope of their work. Tran et al [26] is the first to address the problem of detailed face reconstruction from occluded images by filling in the corrupted region of the bump map using a similar patch in a reference dataset. Although this method can generate a complete representation of face details, the de-occluded face image is not reconstructed [26].…”

Section: Related Workmentioning

confidence: 99%

Face De-Occlusion Using 3D Morphable Model and Generative Adversarial Network

Yuan

Park

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

In recent decades, 3D morphable model (3DMM) has been commonly used in image-based photorealistic 3D face reconstruction. However, face images are often corrupted by serious occlusion by non-face objects including eyeglasses, masks, and hands. Such objects block the correct capture of landmarks and shading information. Therefore, the reconstructed 3D face model is hardly reusable. In this paper, a novel method is proposed to restore de-occluded face images based on inverse use of 3DMM and generative adversarial network. We utilize the 3DMM prior to the proposed adversarial network and combine a global and local adversarial convolutional neural network to learn face deocclusion model. The 3DMM serves not only as geometric prior but also proposes the face region for the local discriminator. Experiment results confirm the effectiveness and robustness of the proposed algorithm in removing challenging types of occlusions with various head poses and illumination. Furthermore, the proposed method reconstructs the correct 3D face model with de-occluded textures.

show abstract

Extreme 3D Face Reconstruction: Seeing Through Occlusions

Cited by 172 publications

References 61 publications

Learning to Regress 3D Face Shape and Expression From an Image Without 3D Supervision

Learning to Regress 3D Face Shape and Expression From an Image Without 3D Supervision

X2Face: A Network for Controlling Face Generation Using Images, Audio, and Pose Codes

Face De-Occlusion Using 3D Morphable Model and Generative Adversarial Network

Contact Info

Product

Resources

About