ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053730
|View full text |Cite
|
Sign up to set email alerts
|

Robust Unsupervised Audio-Visual Speech Enhancement Using a Mixture of Variational Autoencoders

Abstract: Recently, an audio-visual speech generative model based on variational autoencoder (VAE) has been proposed, which is combined with a nonnegative matrix factorization (NMF) model for noise variance to perform unsupervised speech enhancement. When visual data is clean, speech enhancement with audio-visual VAE shows a better performance than with audio-only VAE, which is trained on audio-only data. However, audio-visual VAE is not robust against noisy visual data, e.g., when for some video frames, speaker face is… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
48
0
1

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
1
1

Relationship

1
4

Authors

Journals

citations
Cited by 20 publications
(49 citation statements)
references
References 51 publications
0
48
0
1
Order By: Relevance
“…Estimators of speech quality SNR -It does not provide a proper [12], [65], [66], [109] based on energy ratios (Signal-to-Noise Ratio) estimation of speech distortion SSNR / SSNRI -Assessment of short-time [100], [108], [239] (Segmental SNR) behaviour (SSNR Improvement) SDI [31] 2006 It provides a rough distortion [99], [100] measure SDR [252] 2006 Specifically designed for blind audio [7], [10], [17], [42], [55], [65], [85] source separation [107]- [109], [136], [153], [154], [169] [164], [165], [183], [192], [195], [203] [208], [220]- [222] SIR [252] 2006 Specifically designed for blind audio [7], [65], [107], [136], [164], [165] source separation [195] SAR [252] 2006 Specifically designed for blind audio [65], [107], [136], [164], [165], [195] source separation SI-SDR …”
Section: Ip Transmissionmentioning
confidence: 99%
See 2 more Smart Citations
“…Estimators of speech quality SNR -It does not provide a proper [12], [65], [66], [109] based on energy ratios (Signal-to-Noise Ratio) estimation of speech distortion SSNR / SSNRI -Assessment of short-time [100], [108], [239] (Segmental SNR) behaviour (SSNR Improvement) SDI [31] 2006 It provides a rough distortion [99], [100] measure SDR [252] 2006 Specifically designed for blind audio [7], [10], [17], [42], [55], [65], [85] source separation [107]- [109], [136], [153], [154], [169] [164], [165], [183], [192], [195], [203] [208], [220]- [222] SIR [252] 2006 Specifically designed for blind audio [7], [65], [107], [136], [164], [165] source separation [195] SAR [252] 2006 Specifically designed for blind audio [65], [107], [136], [164], [165], [195] source separation SI-SDR …”
Section: Ip Transmissionmentioning
confidence: 99%
“…This means that in order to have good performance in a wide variety of settings, very large AV datasets for training and testing need to be collected. In practice, the systems are trained using a large number of complex acoustic [66], [76], [77], [85], [99], [122], [128], [164], [165], [176], [178], [179], [220]- [222], [244], [263], [274], [ [17], [65], [154], [164], [165] Landmark-based features [100], [154], [183], [203] Multisensory features [195] Face recognition embedding [55], [109], [169], [192], [239] VSR embedding [7], [10], [107]- [109], [153], [222], [273] Facial appearance embedding [42], [208] Compressed mouth frames [37] Speaker direction [85], [244], [279] Acoustic Features…”
Section: Audio-visual Speech Enhancement and Separation Systemsmentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, some unsupervised AVSE methods have been proposed that do not need noise signals for training [6][7][8], meaning that their training is agnostic to the noise type. This approach builds upon the audio-only speech enhancement counterpart [9,10] consisting of two main steps.…”
Section: Introductionmentioning
confidence: 99%
“…In the VAE-based unsupervised settings, a totally different perspective is pursued owning to its probabilistic nature. In this regard, a robust generative model has been proposed in [7] which is a mixture of trained audio-based (A-VAE) and audiovisual based (AV-VAE) model. As such, following a variational inference approach, for noisy visual data the A-VAE model is chosen, whereas for clean visual data the AV-VAE model is used, thus providing robustness.…”
Section: Introductionmentioning
confidence: 99%