“…To train the network end-to-end, recent methods leverage the differentiable renderers [28,22,80,49], along with the photo loss, perceptual loss, and landmark loss [21,19,28,67,71] to optimize the network in a self-supervised manner. Different from these coarse shape Given a monocular image, we regress its shape and detail coefficients to synthesize a realistic 3D face, and leverage a differentiable renderer [28] to train the whole model end-to-end from synthetic [72,57] and real-world [40,52] images. (b).…”