“…Therefore, AP and F0 were not estimated from the silent video, but artificially produced without taking the visual information into account, while SP was estimated with a Gaussian mixture model (GMM) and FFNN within a regression-based framework. As input to the models, two different visual features were considered, 2-D DCT and AAM, while the explored SP representations were [149] 2017 AAM Codebook entries FFNN / RNN mouth (mel-filterbank amplitudes) [57] 2017 Raw pixels LSP of LPC CNN, FFNN face [56] 2017 Raw pixels, Mel-scale and CNN, FFNN, optical flow linear-scale BiGRU face spectrograms [11] 2018 Raw pixels AE features, CNN, LSTM, face spectrogram FFNN, AE [145] 2018 Raw pixels LSP of LPC CNN, LSTM, mouth FFNN [147] 2018 Raw pixels LSP of LPC CNN, BiGRU, mouth FFNN [146] 2019 Raw pixels LSP of LPC CNN, BiGRU, mouth FFNN [243] 2019 Raw pixels WORLD CNN, FFNN mouth spectrum [256] 2019 Raw pixels Raw waveform GAN, CNN, mouth GRU [247] 2019 Raw pixels AE features, CNN, LSTM mouth spectrogram FFNN, AE [177] 2020 Raw pixels WORLD CNN, GRU, mouth / face features FFNN [206] 2020 Raw pixels mel-scale CNN, LSTM face spectrogram linear predictive coding (LPC) coefficients and mel-filterbank amplitudes. While the choice of visual features did not have a big impact on the results, the use of mel-filterbank amplitudes allowed to outperform the systems based on LPC coefficients.…”