Audio-Driven Emotional Video Portraits

Xinya, Ji,; Zhou, Hang; Wang, Kaisiyuan; Wu, Wayne; Loy, Chen Change; Cao, Xun; Xu, Feng

doi:10.1109/cvpr46437.2021.01386

Cited by 159 publications

(76 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Driving a static portrait with audio is of great importance to a variety of applications in the field of entertainment, such as digital human animation, visual dubbing in movies, and fast creation of short videos. Armed with deep learning, previous researchers take two different paths towards analyzing audio-driven talking human faces: 1) through pure latent feature learning and image reconstruction [14,77,9,72,50,55,44], and 2) to borrow the help of structural intermediate representations such as 2D landmarks [51,10,18] or 3D representations [1,52,49,8,75,46,28]. Though great progress has been made in generating accurate mouth movements, most previous methods fail to model head pose, one of the key factors for talking faces to look natural.…”

Section: Introductionmentioning

confidence: 99%

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation

Zhou

Liu

et al. 2019

AAAI

Self Cite

394

274

View full text Add to dashboard Cite

Talking face generation aims to synthesize a sequence of face images that correspond to a clip of speech. This is a challenging task because face appearance variation and semantics of speech are coupled together in the subtle movements of the talking face regions. Existing works either construct specific face appearance model on specific subjects or model the transformation between lip motion and speech. In this work, we integrate both aspects and enable arbitrary-subject talking face generation by learning disentangled audio-visual representation. We find that the talking face sequence is actually a composition of both subject-related information and speech-related information. These two spaces are then explicitly disentangled through a novel associative-and-adversarial training process. This disentangled representation has an advantage where both audio and video can serve as inputs for generation. Extensive experiments show that the proposed approach generates realistic talking face sequences on arbitrary subjects with much clearer lip motion patterns than previous work. We also demonstrate the learned audio-visual representation is extremely useful for the tasks of automatic lip reading and audio-video retrieval.

show abstract

Section: Introductionmentioning

confidence: 99%

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation

Zhou

Liu

et al. 2019

AAAI

Self Cite

394

274

View full text Add to dashboard Cite

show abstract

“…With the rapid growth of deep neural networks, end-toend frameworks are proposed. One category of methods, namely image reconstruction-based methods, generate talking face by latent feature learning and image reconstruction [11,14,22,46,47,57,60,63,66,80,82,83,85]. For example, Chung et al [11] propose the first end-to-end method with an encoder-decoder pipeline.…”

Section: Related Workmentioning

confidence: 99%

Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation

Liu¹,

Xu²,

Wu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Animating high-fidelity video portrait with speech audio is crucial for virtual reality and digital entertainment. While most previous studies rely on accurate explicit structural information, recent works explore the implicit scene representation of Neural Radiance Fields (NeRF) for realistic generation. In order to capture the inconsistent motions as well as the semantic difference between human head and torso, some work models them via two individual sets of NeRF, leading to unnatural results. In this work, we propose Semantic-aware Speaking Portrait NeRF (SSP-NeRF), which creates delicate audio-driven portraits using one unified set of NeRF. The proposed model can handle the detailed local facial semantics and the global head-torso relationship through two semantic-aware modules. Specifically, we first propose a Semantic-Aware Dynamic Ray Sampling module with an additional parsing branch that facilitates audio-driven volume rendering. Moreover, to enable portrait rendering in one unified neural radiance field, a Torso Deformation module is designed to stabilize the large-scale non-rigid torso motions. Extensive evaluations demonstrate that our proposed approach renders more realistic video portraits compared to previous methods. Project page: https://alvinliu0.github.io/projects/SSP-NeRF.

show abstract

“…Chen et al [32] and Zhou et al [33] proposed different methods to synthesize talking faces with controllable head poses to achieve natural head movements. Some studies have also attempted to generate specific emotional video portraits [34,35]. Eskimez et al [34] introduced an emotion encoder to generate a talking face video with a specific emotion, whereas our method generates talking face videos with specific AUs.…”

Section: Audio-driven Talking Head Generationmentioning

confidence: 99%

Talking Head Generation Driven by Speech-Related Facial Action Units and Audio- Based on Multimodal Representation Fusion

Chen¹,

Liu²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Talking head generation is to synthesize a lip-synchronized talking head video by inputting an arbitrary face image and corresponding audio clips. Existing methods ignore not only the interaction and relationship of cross-modal information, but also the local driving information of the mouth muscles. In this study, we propose a novel generative framework that contains a dilated non-causal temporal convolutional self-attention network as a multimodal fusion module to promote the relationship learning of cross-modal features. In addition, our proposed method uses both audio-and speech-related facial action units (AUs) as driving information. Speech-related AU information can guide mouth movements more accurately. Because speech is highly correlated with speech-related AUs, we propose an audio-to-AU module to predict speech-related AU information. We utilize pre-trained AU classifier to ensure that the generated images contain correct AU information. We verify the effectiveness of the proposed model on the GRID and TCD-TIMIT datasets. An ablation study is also conducted to verify the contribution of each component. The results of quantitative and qualitative experiments demonstrate that our method outperforms existing methods in terms of both image quality and lip-sync accuracy.

show abstract

Audio-Driven Emotional Video Portraits

Cited by 159 publications

References 27 publications

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation

Talking Face Generation by Adversarially Disentangled Audio-Visual Representation

Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation

Talking Head Generation Driven by Speech-Related Facial Action Units and Audio- Based on Multimodal Representation Fusion

Contact Info

Product

Resources

About