“…Driving a static portrait with audio is of great importance to a variety of applications in the field of entertainment, such as digital human animation, visual dubbing in movies, and fast creation of short videos. Armed with deep learning, previous researchers take two different paths towards analyzing audio-driven talking human faces: 1) through pure latent feature learning and image reconstruction [14,77,9,72,50,55,44], and 2) to borrow the help of structural intermediate representations such as 2D landmarks [51,10,18] or 3D representations [1,52,49,8,75,46,28]. Though great progress has been made in generating accurate mouth movements, most previous methods fail to model head pose, one of the key factors for talking faces to look natural.…”