2019
DOI: 10.1007/s11263-019-01251-8
|View full text |Cite
|
Sign up to set email alerts
|

Realistic Speech-Driven Facial Animation with GANs

Abstract: Speech-driven facial animation is the process that automatically synthesizes talking characters based on speech signals. The majority of work in this domain creates a mapping from audio features to visual features. This approach often requires post-processing using computer graphics techniques to produce realistic albeit subject dependent results. We present an end-to-end system that generates videos of a talking head, using only a still image of a person and an audio clip containing speech, without relying on… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
309
0
2

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
2
2

Relationship

2
5

Authors

Journals

citations
Cited by 266 publications
(313 citation statements)
references
References 38 publications
2
309
0
2
Order By: Relevance
“…The proposed architecture is shown in Fig. 1 and is based on our prior work on speech-driven facial animation [17]. The model is a temporal encoder-decoder which takes a still image (frame from a 25 fps video) and an audio singal as input.…”
Section: Self Supervised Speech Representation Learning By Facial Animentioning
confidence: 99%
See 1 more Smart Citation
“…The proposed architecture is shown in Fig. 1 and is based on our prior work on speech-driven facial animation [17]. The model is a temporal encoder-decoder which takes a still image (frame from a 25 fps video) and an audio singal as input.…”
Section: Self Supervised Speech Representation Learning By Facial Animentioning
confidence: 99%
“…multi-task speech representations by leveraging the visual modality (inspired by our prior work [17]). Specifically, we make the following research contributions: (i) We animate a still image to generate speech video by conditioning on the corresponding audio.…”
Section: Introductionmentioning
confidence: 99%
“…We can perform further parameter sharing by assuming that the mode-2 and mode-3 factor matrices are equivalent to the matrices describing the row spaces, i.e. : B (2)…”
Section: Polynomial Fusion Layermentioning
confidence: 99%
“…We ran experiments on the above datasets using the following methodologies for the polynomial fusion layer: (3), (4), (5), (6b) (PF-CMF-SR) We set a = 256, d = 128, n = 10 and trained on video sequences of 3 seconds with frame size 128 × 96 as per [2]. For all models, c = a + d + n = 394, implying m = 384, given that c = m + n by construction.…”
Section: Training Protocolmentioning
confidence: 99%
See 1 more Smart Citation