Face manipulation has shown remarkable advances with the flourish of Generative Adversarial Networks. However, due to the difficulties of controlling the structure and texture in high-resolution, it is challenging to simultaneously model pose and expression during manipulation. In this paper, we propose a novel framework that simplifies face manipulation with extreme pose and expression into two correlated stages: a boundary prediction stage and a disentangled face synthesis stage. In the first stage, we propose to use a boundary image for joint pose and expression modeling. An encoder-decoder network is employed to predict the boundary image of the target face in a semi-supervised way. Pose and expression estimators are employed to improve the prediction accuracy. In the second stage, the predicted boundary image and the original face are encoded into the structure and texture latent space by two encoder networks respectively. A proxy network and a feature threshold loss are further imposed to disentangle the latent space. Furthermore, considering the lack of high-resolution face databases to verify the effectiveness of our method, we collect a new high quality Multi-View Face (MVF-HQ) database in 6000 × 4000 resolution. It contains 120,283 images from 479 identities with diverse pose, expression and illumination variants, which is much larger in scale and much higher in resolution than the current public high-resolution face manipulation databases. We expect it to push forward the advance of face manipulation. Qualitative and quantitative experiments on four databases show that our method dramatically improves the visualization of face manipulation.