ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414097
|View full text |Cite
|
Sign up to set email alerts
|

Switching Variational Auto-Encoders for Noise-Agnostic Audio-Visual Speech Enhancement

Abstract: Recently, audio-visual speech enhancement has been tackled in the unsupervised settings based on variational autoencoders (VAEs), where during training only clean data is used to train a generative model for speech, which at test time is combined with a noise model, e.g. nonnegative matrix factorization (NMF), whose parameters are learned without supervision. Consequently, the proposed model is agnostic to the noise type. When visual data are clean, audio-visual VAE-based architectures usually outperform the a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6

Relationship

1
5

Authors

Journals

citations
Cited by 6 publications
(2 citation statements)
references
References 16 publications
0
2
0
Order By: Relevance
“…In all the following, we do not consider SRNN-TF-Pred anymore, since this model was shown to be inefficient for speech enhancement. The circumflex accent shape of the curves is often encountered when reporting improvement over the noisy input [28], [60]- [62]. Indeed, when the input is very noisy, it is difficult to improve the quality.…”
Section: We Report Inmentioning
confidence: 99%
“…In all the following, we do not consider SRNN-TF-Pred anymore, since this model was shown to be inefficient for speech enhancement. The circumflex accent shape of the curves is often encountered when reporting improvement over the noisy input [28], [60]- [62]. Indeed, when the input is very noisy, it is difficult to improve the quality.…”
Section: We Report Inmentioning
confidence: 99%
“…(AV) multi-modal has been applied widely in speech community [6][7][8][9][10][11][12]. The visual information obtained by analyzing lip shapes or facial expressions of the visual modality is more robust than the audio information from complex scenarios.…”
Section: Introductionmentioning
confidence: 99%