Realistic Speech-Driven Facial Animation with GANs

Vougioukas, Konstantinos; Petridis, Stavros; Pantić, Maja

doi:10.1007/s11263-019-01251-8

Cited by 266 publications

(313 citation statements)

References 38 publications

Supporting

Mentioning

309

Contrasting

Unclassified

Order By: Relevance

“…The proposed architecture is shown in Fig. 1 and is based on our prior work on speech-driven facial animation [17]. The model is a temporal encoder-decoder which takes a still image (frame from a 25 fps video) and an audio singal as input.…”

Section: Self Supervised Speech Representation Learning By Facial Animentioning

confidence: 99%

See 1 more Smart Citation

Visually Guided Self Supervised Learning of Speech Representations

Shukla

Vougioukas

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Self supervised representation learning has recently attracted a lot of research interest for both the audio and visual modalities. However, most works typically focus on a particular modality or feature alone and there has been very limited work that studies the interaction between the two modalities for learning self supervised representations. We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech. We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment. Through this process, the audio encoder network learns useful speech representations that we evaluate on emotion recognition and speech recognition. We achieve state of the art results for emotion recognition and competitive results for speech recognition. This demonstrates the potential of visual supervision for learning audio representations as a novel way for self-supervised learning which has not been explored in the past. The proposed unsupervised audio features can leverage a virtually unlimited amount of training data of unlabelled audiovisual speech and have a large number of potentially promising applications.

show abstract

Section: Self Supervised Speech Representation Learning By Facial Animentioning

confidence: 99%

“…multi-task speech representations by leveraging the visual modality (inspired by our prior work [17]). Specifically, we make the following research contributions: (i) We animate a still image to generate speech video by conditioning on the corresponding audio.…”

Section: Introductionmentioning

confidence: 99%

Visually Guided Self Supervised Learning of Speech Representations

Shukla

Vougioukas

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…We can perform further parameter sharing by assuming that the mode-2 and mode-3 factor matrices are equivalent to the matrices describing the row spaces, i.e. : B (2)…”

Section: Polynomial Fusion Layermentioning

confidence: 99%

“…We ran experiments on the above datasets using the following methodologies for the polynomial fusion layer: (3), (4), (5), (6b) (PF-CMF-SR) We set a = 256, d = 128, n = 10 and trained on video sequences of 3 seconds with frame size 128 × 96 as per [2]. For all models, c = a + d + n = 394, implying m = 384, given that c = m + n by construction.…”

Section: Training Protocolmentioning

confidence: 99%

“…CGI traditionally employs face capture methods for facial synthesis [2], which are costly and require manual labor. To alleviate this, recent research has focused on automatic generation of video with machine learning [1,2,3,4,5]. MoCo-GAN [4] models motion and content as separate latent spaces, which are learned in an unsupervised way using both image and video discriminators.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Speech-Driven Facial Animation Using Polynomial Fusion of Features

Kefalas

Vougioukas

Panagakis

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Speech-driven facial animation involves using a speech signal to generate realistic videos of talking faces. Recent deep learning approaches to facial synthesis rely on extracting low-dimensional representations and concatenating them, followed by a decoding step of the concatenated vector. This accounts for only first-order interactions of the features and ignores higher-order interactions. In this paper we propose a polynomial fusion layer that models the joint representation of the encodings by a higher-order polynomial, with the parameters modelled by a tensor decomposition. We demonstrate the suitability of this approach through experiments on generated videos evaluated on a range of metrics on video quality, audiovisual synchronisation and generation of blinks.

show abstract