2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.00338
|View full text |Cite
|
Sign up to set email alerts
|

Expressive Talking Head Generation with Granular Audio-Visual Control

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
26
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 81 publications
(27 citation statements)
references
References 37 publications
1
26
0
Order By: Relevance
“…As a branch, speechdriven facial animation is to reenact a person in sync with input speech sequences. While extensive literature in this field works on 2D talking heads [1,[7][8][9]11,23,28,29,38,39,48,52,62,64,68,69], we focus on facial animation on 3D models in this work, which can be roughly categorized into linguistics-based and learning-based methods.…”
Section: Speech-driven 3d Facial Animationmentioning
confidence: 99%
“…As a branch, speechdriven facial animation is to reenact a person in sync with input speech sequences. While extensive literature in this field works on 2D talking heads [1,[7][8][9]11,23,28,29,38,39,48,52,62,64,68,69], we focus on facial animation on 3D models in this work, which can be roughly categorized into linguistics-based and learning-based methods.…”
Section: Speech-driven 3d Facial Animationmentioning
confidence: 99%
“…Audio-driven talking head generation [13,19,38,14] is another popular direction on this topic, as audio sequences do not contain information of the face identity, and is relatively easier to disentangle the motion information from the input audio. Liang et al [16] explicitly divide the driving audio into granular parts through delicate priors to control the lip shape, face pose, and facial expression.…”
Section: Related Workmentioning
confidence: 99%
“…The W-GAN learns the distribution of facial expression dynamics of different classes, from which new facial expression motions are synthesized and transformed to videos by Texture-GAN. Other works have investigated guiding facial expression generation by speech audio data, such as (Chen et al, 2020;Guo et al, 2021;Wang et al, 2022;Liang et al, 2022), or by a combination of audio and facial landmark information, like (Wang et al, 2021;Wu et al, 2021;Sinha et al, 2022) . All the methods mentioned before are methods that generate a single frame at a time-step, which lowers the dependency between the video frames causing the lack of spatio-temporal consistency.…”
Section: Related Workmentioning
confidence: 99%