Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1398
|View full text |Cite
|
Sign up to set email alerts
|

A Statistically Principled and Computationally Efficient Approach to Speech Enhancement Using Variational Autoencoders

Abstract: Recent studies have explored the use of deep generative models of speech spectra based of variational autoencoders (VAEs), combined with unsupervised noise models, to perform speech enhancement. These studies developed iterative algorithms involving either Gibbs sampling or gradient descent at each step, making them computationally expensive. This paper proposes a variational inference method to iteratively estimate the power spectrogram of the clean speech. Our main contribution is the analytical derivation o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 21 publications
(18 citation statements)
references
References 24 publications
0
18
0
Order By: Relevance
“…To reduce the computational cost, we previously proposed to exploit the pretrained encoder of a CVAE as an approximate posterior estimator to infer the latent space variable z in [1]. With the same motivation, a fast algorithm for estimating the parameters of the VAE-NMF model was later derived based on the Bayesian inference in [38] for single-channel speech enhancement.…”
Section: Vae-based Methodsmentioning
confidence: 99%
“…To reduce the computational cost, we previously proposed to exploit the pretrained encoder of a CVAE as an approximate posterior estimator to infer the latent space variable z in [1]. With the same motivation, a fast algorithm for estimating the parameters of the VAE-NMF model was later derived based on the Bayesian inference in [38] for single-channel speech enhancement.…”
Section: Vae-based Methodsmentioning
confidence: 99%
“…where ϕ FFNN enc (⋅ ; θenc) ∶ C F ↦ R L × R L + denotes the output of an FFNN. Such an architecture was used in [8,9,10,11,12,13,14]. This is the only case where, from the approximate posterior, we can sample all latent vectors in parallel for all time frames, without further approximation.…”
Section: Trainingmentioning
confidence: 99%
“…for all TF bins (f, n). Similarly as done in the previous works [5][6][7][8][9][10], we use an unsupervised NMF-based Gaussian noise model that assumes independence across TF bins:…”
Section: Vae-mm Inference and Learningmentioning
confidence: 99%