“…Therefore, facial landmarks can be used to represent the facial-related characteristics, e.g., face shapes, head poses, and mouth shapes, and it is easy to build mapping relation between facial landmarks and the facial expression in a photo. Recently, the facial landmarks have been popular intermediate representations to bridge the gap between the raw audio signal and photo-realistic videos in recent research [42], [5], [43], [10], [44]. However, these methods suffer from several limitations, such as only can be used for the person that used for the training data [42] and can not be used for arbitrary persons, depending on reference videos to provide pose information [42], [44], or no head movement is predicted and can only present static head pose [10].…”