Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-3269
|View full text |Cite
|
Sign up to set email alerts
|

Hush-Hush Speak: Speech Reconstruction Using Silent Videos

Abstract: Speech Reconstruction is the task of recreation of speech using silent videos as input. In the literature, it is also referred to as lipreading. In this paper, we design an encoder-decoder architecture which takes silent videos as input and outputs an audio spectrogram of the reconstructed speech. The model, despite being a speaker-independent model, achieves comparable results on speech reconstruction to the current state-of-the-art speaker-dependent model. We also perform user studies to infer speech intelli… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
3
1

Relationship

3
6

Authors

Journals

citations
Cited by 14 publications
(10 citation statements)
references
References 12 publications
0
10
0
Order By: Relevance
“…Therefore, AP and F0 were not estimated from the silent video, but artificially produced without taking the visual information into account, while SP was estimated with a Gaussian mixture model (GMM) and FFNN within a regression-based framework. As input to the models, two different visual features were considered, 2-D DCT and AAM, while the explored SP representations were [149] 2017 AAM Codebook entries FFNN / RNN mouth (mel-filterbank amplitudes) [57] 2017 Raw pixels LSP of LPC CNN, FFNN face [56] 2017 Raw pixels, Mel-scale and CNN, FFNN, optical flow linear-scale BiGRU face spectrograms [11] 2018 Raw pixels AE features, CNN, LSTM, face spectrogram FFNN, AE [145] 2018 Raw pixels LSP of LPC CNN, LSTM, mouth FFNN [147] 2018 Raw pixels LSP of LPC CNN, BiGRU, mouth FFNN [146] 2019 Raw pixels LSP of LPC CNN, BiGRU, mouth FFNN [243] 2019 Raw pixels WORLD CNN, FFNN mouth spectrum [256] 2019 Raw pixels Raw waveform GAN, CNN, mouth GRU [247] 2019 Raw pixels AE features, CNN, LSTM mouth spectrogram FFNN, AE [177] 2020 Raw pixels WORLD CNN, GRU, mouth / face features FFNN [206] 2020 Raw pixels mel-scale CNN, LSTM face spectrogram linear predictive coding (LPC) coefficients and mel-filterbank amplitudes. While the choice of visual features did not have a big impact on the results, the use of mel-filterbank amplitudes allowed to outperform the systems based on LPC coefficients.…”
Section: A Speech Reconstruction From Silent Videosmentioning
confidence: 99%
“…Therefore, AP and F0 were not estimated from the silent video, but artificially produced without taking the visual information into account, while SP was estimated with a Gaussian mixture model (GMM) and FFNN within a regression-based framework. As input to the models, two different visual features were considered, 2-D DCT and AAM, while the explored SP representations were [149] 2017 AAM Codebook entries FFNN / RNN mouth (mel-filterbank amplitudes) [57] 2017 Raw pixels LSP of LPC CNN, FFNN face [56] 2017 Raw pixels, Mel-scale and CNN, FFNN, optical flow linear-scale BiGRU face spectrograms [11] 2018 Raw pixels AE features, CNN, LSTM, face spectrogram FFNN, AE [145] 2018 Raw pixels LSP of LPC CNN, LSTM, mouth FFNN [147] 2018 Raw pixels LSP of LPC CNN, BiGRU, mouth FFNN [146] 2019 Raw pixels LSP of LPC CNN, BiGRU, mouth FFNN [243] 2019 Raw pixels WORLD CNN, FFNN mouth spectrum [256] 2019 Raw pixels Raw waveform GAN, CNN, mouth GRU [247] 2019 Raw pixels AE features, CNN, LSTM mouth spectrogram FFNN, AE [177] 2020 Raw pixels WORLD CNN, GRU, mouth / face features FFNN [206] 2020 Raw pixels mel-scale CNN, LSTM face spectrogram linear predictive coding (LPC) coefficients and mel-filterbank amplitudes. While the choice of visual features did not have a big impact on the results, the use of mel-filterbank amplitudes allowed to outperform the systems based on LPC coefficients.…”
Section: A Speech Reconstruction From Silent Videosmentioning
confidence: 99%
“…With advancements in deep learning and the increasing availability of large lip reading datasets, there have been approaches such as (Wand, Koutník, and Schmidhuber 2016;Zhou et al 2019;Salik et al 2019) addressing lip reading using deep learning based algorithms such as CNNs and LSTMs. Further, there have been works such as (Lee, Lee, and Kim 2016;Kumar et al 2018b;Uttam et al 2019;Kumar et al 2019) extending lip reading from single view to multiview settings by incorporating videos of mouth sections from multiple views together. Multi-view lip reading has been shown to improve performance significantly as compared to single view.…”
Section: Visual Speech Recognitionmentioning
confidence: 99%
“…With the advancements in deep learning and the increasing availability of large-scale lip reading datasets, approaches have been proposed that address lip reading using deep learning based algorithms (e.g., CNNs and LSTMs) [35,45,49]. Moreover, researchers have extended lip reading from single view to multi-view settings by incorporating videos of mouth section from multiple views together [21,22,24,37,44],. Multi-view lip reading has shown to improve performance significantly as compared to single view.…”
Section: Related Workmentioning
confidence: 99%