ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054469
|View full text |Cite
|
Sign up to set email alerts
|

Speech-Driven Facial Animation Using Polynomial Fusion of Features

Abstract: Speech-driven facial animation involves using a speech signal to generate realistic videos of talking faces. Recent deep learning approaches to facial synthesis rely on extracting low-dimensional representations and concatenating them, followed by a decoding step of the concatenated vector. This accounts for only first-order interactions of the features and ignores higher-order interactions. In this paper we propose a polynomial fusion layer that models the joint representation of the encodings by a higher-ord… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
2
2
1

Relationship

1
4

Authors

Journals

citations
Cited by 6 publications
(3 citation statements)
references
References 96 publications
0
3
0
Order By: Relevance
“…More specifically, one needs to disentangle shape and motion factors and subsequently exploit them in reconstructing 3D visual objects. Face editing [178], [179] and talking faces synthesis [180] are further problems requiring the manipulation of multiple latent variables.…”
Section: A Multiple Factors Analysis In Computer Visionmentioning
confidence: 99%
“…More specifically, one needs to disentangle shape and motion factors and subsequently exploit them in reconstructing 3D visual objects. Face editing [178], [179] and talking faces synthesis [180] are further problems requiring the manipulation of multiple latent variables.…”
Section: A Multiple Factors Analysis In Computer Visionmentioning
confidence: 99%
“…A VGG-M network pretrained on the VGG Face dataset [293] is employed in [281] to learn face features and a speech encoder modified from VGG-M is used to learn speech embedding. Three temporal encoders are used to extract representations of the speaker's identity, the audio segment, and the facial expressions, and a polynomial fusion layer is designed to generate a joint representation of the three encodings [260]. Differently, Mittal et al [266] develop a VAE to disentangle the phonetic content, emotional tone, and other factors into different representations solely from the input audio signal.…”
Section: Multi-modal Feature Extractionmentioning
confidence: 99%
“…The input of the decoder is the output of the encoder adding the 10 dimensional random noise. As described in [27], the random noise can introduce some natural variability during the audio-visual generation task. We used multiple random noises to generate various kinds of mapping gestures from one speech audio.…”
Section: B Gan Model For Speech-driven Gesture Generationmentioning
confidence: 99%