2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.01386
|View full text |Cite
|
Sign up to set email alerts
|

Audio-Driven Emotional Video Portraits

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
75
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
3
3

Relationship

3
7

Authors

Journals

citations
Cited by 159 publications
(76 citation statements)
references
References 27 publications
1
75
0
Order By: Relevance
“…Driving a static portrait with audio is of great importance to a variety of applications in the field of entertainment, such as digital human animation, visual dubbing in movies, and fast creation of short videos. Armed with deep learning, previous researchers take two different paths towards analyzing audio-driven talking human faces: 1) through pure latent feature learning and image reconstruction [14,77,9,72,50,55,44], and 2) to borrow the help of structural intermediate representations such as 2D landmarks [51,10,18] or 3D representations [1,52,49,8,75,46,28]. Though great progress has been made in generating accurate mouth movements, most previous methods fail to model head pose, one of the key factors for talking faces to look natural.…”
Section: Introductionmentioning
confidence: 99%
“…Driving a static portrait with audio is of great importance to a variety of applications in the field of entertainment, such as digital human animation, visual dubbing in movies, and fast creation of short videos. Armed with deep learning, previous researchers take two different paths towards analyzing audio-driven talking human faces: 1) through pure latent feature learning and image reconstruction [14,77,9,72,50,55,44], and 2) to borrow the help of structural intermediate representations such as 2D landmarks [51,10,18] or 3D representations [1,52,49,8,75,46,28]. Though great progress has been made in generating accurate mouth movements, most previous methods fail to model head pose, one of the key factors for talking faces to look natural.…”
Section: Introductionmentioning
confidence: 99%
“…With the rapid growth of deep neural networks, end-toend frameworks are proposed. One category of methods, namely image reconstruction-based methods, generate talking face by latent feature learning and image reconstruction [11,14,22,46,47,57,60,63,66,80,82,83,85]. For example, Chung et al [11] propose the first end-to-end method with an encoder-decoder pipeline.…”
Section: Related Workmentioning
confidence: 99%
“…Chen et al [32] and Zhou et al [33] proposed different methods to synthesize talking faces with controllable head poses to achieve natural head movements. Some studies have also attempted to generate specific emotional video portraits [34,35]. Eskimez et al [34] introduced an emotion encoder to generate a talking face video with a specific emotion, whereas our method generates talking face videos with specific AUs.…”
Section: Audio-driven Talking Head Generationmentioning
confidence: 99%