2020
DOI: 10.1109/taslp.2020.3025638
|View full text |Cite
|
Sign up to set email alerts
|

Speech Enhancement Based on Denoising Autoencoder With Multi-Branched Encoders

Abstract: Deep learning-based models have greatly advanced the performance of speech enhancement (SE) systems. However, two problems remain unsolved, which are closely related to model generalizability to noisy conditions: (1) mismatched noisy condition during testing, i.e., the performance is generally suboptimal when models are tested with unseen noise types that are not involved in the training data; (2) local focus on specific noisy conditions, i.e., models trained using multiple types of noises cannot optimally rem… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
5
0
4

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 28 publications
(9 citation statements)
references
References 63 publications
0
5
0
4
Order By: Relevance
“…Different types of mask-based methods have been used in the literature, such as ideal binary masks (IBM) and ideal ratio masks (IRM) [3]. Auto-encoder based approaches to speech enhancement favor compact features such as Melfrequency power spectra [4] and short term Fourier transform (STFT) spectra computed across short utterances [5,6,7] or a small temporal context [8]. Deep networks predominantly use higher-dimension log-power spectra with a comparably long temporal context in an attempt to learn features best representing clean speech [9,10].…”
Section: Introductionmentioning
confidence: 99%
“…Different types of mask-based methods have been used in the literature, such as ideal binary masks (IBM) and ideal ratio masks (IRM) [3]. Auto-encoder based approaches to speech enhancement favor compact features such as Melfrequency power spectra [4] and short term Fourier transform (STFT) spectra computed across short utterances [5,6,7] or a small temporal context [8]. Deep networks predominantly use higher-dimension log-power spectra with a comparably long temporal context in an attempt to learn features best representing clean speech [9,10].…”
Section: Introductionmentioning
confidence: 99%
“…So far, autoencoders were used in many audio applications as an analysis-synthesis scheme where the input signals dimension is reduced to a latent vector (encoding), and the signal is regenerated from it (decoding). In [18] authors used Denoising AE to reduce noise and enhance the quality of synthesized speech. In addition, deep autoencoder is used to extract significant features from the spectral envelop which improve the text to speech synthesis procedure [19].…”
Section: Introductionmentioning
confidence: 99%
“…Os sistemas atuais de reconhecimento automático de fala (automatic speech recognition -ASR) têm exibido desempenho satisfatório em cenários acústicos com níveis de ruído controlados, contudo, em ambientes com baixa razão sinalruído (signal-to-noise ratio -SNR), a operação desses sistemas se torna severamente prejudicada [6]. Nesse contexto, apesar de a robustez ao ruído ainda ser um problema crítico em aplicações do mundo real, a maioria dos trabalhos de pesquisa do estado-da-arte em KWS não tem levado em consideração (de forma eficaz) os efeitos do ruído [7], [8].…”
Section: Introductionunclassified
“…Em [8] e [9], são discutidas diversas estratégias de redução de ruído e realce do sinal de fala. Recentemente, com o desenvolvimento das técnicas de aprendizado profundo, grandes avanços vêm sendo alcançados nessas áreas de aplicação.…”
Section: Introductionunclassified