InputFace Manipulation Results of Our Model Figure 1: Face manipulation results on in-the-wild samples via transferring knowledge learned from the CelebA dataset. The first column shows input images and the remainders are images generated by AF-VAE with target expression/rotation boundary maps as the condition. Note that the model is fine-tuned with movie clip frames from YouTube of 256 × 256 resolution. All the generated poses are unseen before.
AbstractRecent studies have shown remarkable success in face manipulation task with the advance of GANs and VAEs paradigms, but the outputs are sometimes limited to lowresolution and lack of diversity.In this work, we propose Additive Focal Variational Auto-encoder (AF-VAE), a novel approach that can arbitrarily manipulate high-resolution face images using a simple yet effective model and only weak supervision of reconstruction and KL divergence losses. First, a novel additive Gaussian Mixture assumption is introduced with an unsupervised clustering mechanism in the structural latent 1 Work done during an internship at SenseTime Research. space, which endows better disentanglement and boosts multi-modal representation with external memory. Second, to improve the perceptual quality of synthesized results, two simple strategies in architecture design are further tailored and discussed on the behavior of Human Visual System (HVS) for the first time, allowing for fine control over the model complexity and sample quality. Human opinion studies and new state-of-the-art Inception Score (IS) / Fréchet Inception Distance (FID) demonstrate the superiority of our approach over existing algorithms, advancing both the fidelity and extremity of face manipulation task.