Video-Driven Speech Reconstruction Using Generative Adversarial Networks

Vougioukas, Konstantinos; Ma, Pingchuan; Petridis, Stavros; Pantić, Maja

doi:10.21437/interspeech.2019-1445

Cited by 41 publications

(56 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, AP and F0 were not estimated from the silent video, but artificially produced without taking the visual information into account, while SP was estimated with a Gaussian mixture model (GMM) and FFNN within a regression-based framework. As input to the models, two different visual features were considered, 2-D DCT and AAM, while the explored SP representations were [149] 2017 AAM Codebook entries FFNN / RNN mouth (mel-filterbank amplitudes) [57] 2017 Raw pixels LSP of LPC CNN, FFNN face [56] 2017 Raw pixels, Mel-scale and CNN, FFNN, optical flow linear-scale BiGRU face spectrograms [11] 2018 Raw pixels AE features, CNN, LSTM, face spectrogram FFNN, AE [145] 2018 Raw pixels LSP of LPC CNN, LSTM, mouth FFNN [147] 2018 Raw pixels LSP of LPC CNN, BiGRU, mouth FFNN [146] 2019 Raw pixels LSP of LPC CNN, BiGRU, mouth FFNN [243] 2019 Raw pixels WORLD CNN, FFNN mouth spectrum [256] 2019 Raw pixels Raw waveform GAN, CNN, mouth GRU [247] 2019 Raw pixels AE features, CNN, LSTM mouth spectrogram FFNN, AE [177] 2020 Raw pixels WORLD CNN, GRU, mouth / face features FFNN [206] 2020 Raw pixels mel-scale CNN, LSTM face spectrogram linear predictive coding (LPC) coefficients and mel-filterbank amplitudes. While the choice of visual features did not have a big impact on the results, the use of mel-filterbank amplitudes allowed to outperform the systems based on LPC coefficients.…”

Section: A Speech Reconstruction From Silent Videosmentioning

confidence: 99%

See 1 more Smart Citation

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Michelsanti

Tan

Zhang

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

187

View full text Add to dashboard Cite

Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. More recently, visual information from the target speakers, such as lip movements and facial expressions, has been introduced to speech enhancement and speech separation systems, because the visual aspect of speech is essentially unaffected by the acoustic environment. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving state-of-the-art performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: visual features; acoustic features; deep learning methods; fusion techniques; training targets and objective functions. We also survey commonly employed audio-visual speech datasets, given their central role in the development of data-driven approaches, and evaluation methods, because they are generally used to compare different systems and determine their performance. In addition, we review deeplearning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals, since these methods can be more or less directly applied to audio-visual speech enhancement and separation.

show abstract

Section: A Speech Reconstruction From Silent Videosmentioning

confidence: 99%

“…The method proposed in [177] intended to still be able to reconstruct speech in a speaker independent scenario, but also to avoid artefacts similar to the ones introduced by the model in [256]. Therefore, vocoder features were used as training target instead of raw waveforms.…”

Section: A Speech Reconstruction From Silent Videosmentioning

confidence: 99%

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Michelsanti

Tan

Zhang

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

187

View full text Add to dashboard Cite

show abstract

“…The encoded features are then passed to a pre-trained autoencoder network to predict the audio features. Using an adversarial training approach, in [9], the author proposed using a discriminator network (called "critic") that distinguishes real audio from the generated audio. However, the model uses a pre-trained speech-to-video network [10] as a reference model to minimize the perceptual loss between the features obtained from real and generated audios.…”

Section: Related Workmentioning

confidence: 99%

Speech Prediction in Silent Videos Using Variational Autoencoders

Yadav

Sardana²,

Namboodiri

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Understanding the relationship between the auditory and visual signals is crucial for many different applications ranging from computer-generated imagery (CGI) and video editing automation to assisting people with hearing or visual impairments. However, this is challenging since the distribution of both audio and visual modality is inherently multimodal. Therefore, most of the existing methods ignore the multimodal aspect and assume that there only exists a deterministic one-to-one mapping between the two modalities. It can lead to low-quality predictions as the model collapses to optimizing the average behavior rather than learning the full data distributions. In this paper, we present a stochastic model for generating speech in a silent video. The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory signal's conditional distribution given the visual signal. We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.

show abstract

“…Besides, sound and image can be converted to each other in [9]. Speech is reconstructed from facial videos in [40]. Talking face generation [23,42,45,49,51] is a typical example of audio-to-video generation.…”

Section: Multi-modal Generationmentioning

confidence: 99%

Talking Face Generation with Expression-Tailored Generative Adversarial Network

Zeng

Han

Lin

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

A key of automatically generating vivid talking faces is to synthesize identity-preserving natural facial expressions beyond audio-lip synchronization, which usually need to disentangle the informative features from multiple modals and then fuse them together. In this paper, we propose an end-to-end Expression-Tailored Generative Adversarial Network (ET-GAN) to generate an expression enriched talking face video of arbitrary identity. Different from talking face generation based on identity image and audio, an expressional video of arbitrary identity serves as the expression source in our approach. Expression encoder is proposed to disentangle expression-tailored representation from the guiding expressional video, while audio encoder disentangles audio-lip representation. Instead of using single image as identity input, multi-image identity encoder is proposed by learning different views of faces and merging a unified representation. Multiple discriminators are exploited to keep both image-aware and the video-aware realistic details, including a spatial-temporal discriminator for visual continuity of expression synthesis and facial movements. We conduct extensive experimental evaluations on quantitative metrics, expression retention quality and audiovisual synchronization. The results show the effectiveness of our ET-GAN in generating high quality expressional talking face videos against existing state-of-the-arts.

show abstract

Video-Driven Speech Reconstruction Using Generative Adversarial Networks

Cited by 41 publications

References 29 publications

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Speech Prediction in Silent Videos Using Variational Autoencoders

Talking Face Generation with Expression-Tailored Generative Adversarial Network

Contact Info

Product

Resources

About