Mapping and Masking Targets Comparison using Different Deep Learning based Speech Enhancement Architectures

Nossier, Soha A.; Wall, Julie; Moniri, M.; Glackin, Cornelius; Cannings, Nigel

doi:10.1109/ijcnn48605.2020.9206623

Cited by 15 publications

(5 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…where, [30]. The noisy phase was stored to be added to the final estimated clean speech, assuming that the phase component is not highly affected by noise, compared to the magnitude [20].…”

Section: The Proposed Speech Enhancement Approach a Problem Definitionmentioning

confidence: 99%

Two-Stage Deep Learning Approach for Speech Enhancement and Reconstruction in The Frequency and Time Domains

Nossier

Wall

Moniri

et al. 2022

2022 International Joint Conference on Neural Networks (IJCNN)

Self Cite

View full text Add to dashboard Cite

Deep learning has recently shown promising improvement in the speech enhancement field, due to its effectiveness in eliminating noise. However, a drawback of the denoising process is the introduction of speech distortion, which negatively affects speech quality and intelligibility. In this work, we propose a deep convolutional denoising autoencoder-based speech enhancement network that is designed to have an encoder deeper than the decoder, to improve performance and decrease complexity. Furthermore, we present a two-stage learning approach, in which denoising is performed in the first frequency domain stage using magnitude spectrum as a training target; while, in the second stage, further denoising and speech reconstruction are performed in the time domain. Results show that our architecture achieves 0.22 improvement in the overall predicted mean opinion score (Covl) over state of the art speech enhancement architectures, using the Valentini dataset benchmark. Moreover, the architecture was trained using a larger dataset and tested using a mismatched test corpus, to achieve 0.7 and 6.35% improvement in Perceptual Evaluation of Speech Quality (PESQ) and Short Time Objective Intelligibility (STOI) scores, respectively, compared to the noisy speech.

show abstract

“…where, [30]. The noisy phase was stored to be added to the final estimated clean speech, assuming that the phase component is not highly affected by noise, compared to the magnitude [20].…”

Section: The Proposed Speech Enhancement Approach a Problem Definitionmentioning

confidence: 99%

Two-Stage Deep Learning Approach for Speech Enhancement and Reconstruction in The Frequency and Time Domains

Nossier

Wall

Moniri

et al. 2022

2022 International Joint Conference on Neural Networks (IJCNN)

Self Cite

View full text Add to dashboard Cite

show abstract

“…While it is referred to as classification problem if the target is to estimate a matrix, known as a mask. The mask is applied as filter to the output to produce the enhanced clean speech signal [39].…”

Section: Speech Enhancementmentioning

confidence: 99%

Speech Enhancement Using Deep Learning Methods: A Review

Yuliani

Amri

Suryawati

et al. 2021

J. Elektron. dan Telekomun.

View full text Add to dashboard Cite

Speech enhancement, which aims to recover the clean speech of the corrupted signal, plays an important role in the digital speech signal processing. According to the type of degradation and noise in the speech signal, approaches to speech enhancement vary. Thus, the research topic remains challenging in practice, specifically when dealing with highly non-stationary noise and reverberation. Recent advance of deep learning technologies has provided great support for the progress in speech enhancement research field. Deep learning has been known to outperform the statistical model used in the conventional speech enhancement. Hence, it deserves a dedicated survey. In this review, we described the advantages and disadvantages of recent deep learning approaches. We also discussed challenges and trends of this field. From the reviewed works, we concluded that the trend of the deep learning architecture has shifted from the standard deep neural network (DNN) to convolutional neural network (CNN), which can efficiently learn temporal information of speech signal, and generative adversarial network (GAN), that utilize two networks training.

show abstract

“…In Reference [30], an investigation is presented on the two speech enhancement learning domains, time, and frequency; while, the work in [31] explains how CNNs learn features from raw audio time series. In Reference [22], the effect of the speech enhancement training targets used for the MLP architecture was studied; and recently, this study was extended to include different architectures [32]. The use of different loss functions for the time domain approach for speech enhancement was also recently evaluated in [33].…”

Section: Problem Definition and Research Contributionmentioning

confidence: 99%

“…The magnitude power spectrum of the signal was then extracted with 256 FFT size, and the noisy phase was kept to be added to the estimated clean speech, while assuming that the phase is less affected by the noise [94]. Magnitude spectrogram mapping is the training target used in all evaluations in order to ensure the good generalization for all architecture types [32].…”

Section: Training Setupmentioning

confidence: 99%

An Experimental Analysis of Deep Learning Architectures for Supervised Speech Enhancement

et al. 2020

Self Cite

View full text Add to dashboard Cite

Recent speech enhancement research has shown that deep learning techniques are very effective in removing background noise. Many deep neural networks are being proposed, showing promising results for improving overall speech perception. The Deep Multilayer Perceptron, Convolutional Neural Networks, and the Denoising Autoencoder are well-established architectures for speech enhancement; however, choosing between different deep learning models has been mainly empirical. Consequently, a comparative analysis is needed between these three architecture types in order to show the factors affecting their performance. In this paper, this analysis is presented by comparing seven deep learning models that belong to these three categories. The comparison includes evaluating the performance in terms of the overall quality of the output speech using five objective evaluation metrics and a subjective evaluation with 23 listeners; the ability to deal with challenging noise conditions; generalization ability; complexity; and, processing time. Further analysis is then provided while using two different approaches. The first approach investigates how the performance is affected by changing network hyperparameters and the structure of the data, including the Lombard effect. While the second approach interprets the results by visualizing the spectrogram of the output layer of all the investigated models, and the spectrograms of the hidden layers of the convolutional neural network architecture. Finally, a general evaluation is performed for supervised deep learning-based speech enhancement while using SWOC analysis, to discuss the technique’s Strengths, Weaknesses, Opportunities, and Challenges. The results of this paper contribute to the understanding of how different deep neural networks perform the speech enhancement task, highlight the strengths and weaknesses of each architecture, and provide recommendations for achieving better performance. This work facilitates the development of better deep neural networks for speech enhancement in the future.

show abstract

Mapping and Masking Targets Comparison using Different Deep Learning based Speech Enhancement Architectures

Cited by 15 publications

References 46 publications

Two-Stage Deep Learning Approach for Speech Enhancement and Reconstruction in The Frequency and Time Domains

Two-Stage Deep Learning Approach for Speech Enhancement and Reconstruction in The Frequency and Time Domains

Speech Enhancement Using Deep Learning Methods: A Review

An Experimental Analysis of Deep Learning Architectures for Supervised Speech Enhancement

Contact Info

Product

Resources

About