DeepMMSE: A Deep Learning Approach to MMSE-Based Noise Power Spectral Density Estimation

Zhang, Qiquan; Nicolson, Aaron; Wang, Mingjiang; Paliwal, Kuldip K.; Wang, Chenxu

doi:10.1109/taslp.2020.2987441

Cited by 115 publications

(99 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Deep neural network: ResNet-TCN A modified version of the residual network (ResNet) TCN from Zhang et al (2020) is used to evaluate each training target. 3 The set of hyperparameters for ResNet-TCN used in this work are derived from Zhang et al (2020). It is shown from input to output in Figure 2.…”

Section: Experiments Setupmentioning

confidence: 99%

“…unit Each block contains three one-dimensional causal dilated convolutional units. Here, we modify the preactivation of the convolutional units in Zhang et al (2020) by using the rectifier linear activation function followed by layer normalisation without the scale and shift operations (again following Xu et al (2019)). The kernel size, output size, and dilation rate for each convolutional unit is denoted in Figure 2 as (kernel size, output size, dilation rate).…”

Section: Conv1dmentioning

confidence: 99%

See 1 more Smart Citation

On Training Targets for Deep Learning Approaches to Clean Speech Magnitude Spectrum Estimation

Nicolson¹,

Paliwal²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Estimation of the clean speech short-time magnitude spectrum (MS) is key for speech enhancement and separation. Moreover, an automatic speech recognition (ASR) system that employs a front-end relies on clean speech MS estimation to remain robust. Training targets for deep learning approaches to clean speech MS estimation fall into three categories: computational auditory scene analysis (CASA), MS, and minimum mean-square error (MMSE) estimator training targets. The choice of training target can have a significant impact on speech enhancement/separation and robust ASR performance. Motivated by this, we find which training target produces enhanced/separated speech at the highest quality and intelligibility, and which is best for an ASR front-end. Three different deep neural network (DNN) types and two datasets that include real-world non-stationary and coloured noise sources at multiple SNR levels were used for evaluation. Ten objective measures were employed, including the word error rate (WER) of the Deep Speech ASR system. We find that training targets that estimate the <i>a priori</i> signal-to-noise ratio (SNR) for MMSE estimators produce the highest objective quality scores. Moreover, we find that the gain of MMSE estimators and the ideal amplitude mask (IAM) produce the highest objective intelligibility scores and are most suitable for an ASR front-end.

show abstract

Section: Experiments Setupmentioning

confidence: 99%

Section: Conv1dmentioning

confidence: 99%

On Training Targets for Deep Learning Approaches to Clean Speech Magnitude Spectrum Estimation

Nicolson¹,

Paliwal²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…A modified version of the residual network (ResNet) TCN from (Zhang et al, 2020) is used to evaluate each training target. 3 It is shown from input to output in Figure 2.…”

Section: A Deep Neural Network: Resnet Tcnmentioning

confidence: 99%

“…The input is first transformed by FC, a fully-connected layer of size d model = 256. Instead of applying layer normalisation (Ba et al, 2016) followed by the rectifier linear function to FC, as in (Zhang et al, 2020), we apply the rectifier linear activation function followed by layer normalisation without the scale and shift operations. This reduces overfitting, as demonstrated in (Xu et al, 2019).…”

Section: A Deep Neural Network: Resnet Tcnmentioning

confidence: 99%

“…Each block contains three one-dimensional causal dilated convolutional units. Here, we modify the preactivation of the convolutional units in (Zhang et al, 2020) by using the rectifier linear activation function followed by layer normalisation without the scale and shift operations (again following (Xu et al, 2019)). The kernel size, output size, and dilation rate for each convolutional unit is denoted in Figure 2 as (kernel size, output size, dilation rate).…”

Section: A Deep Neural Network: Resnet Tcnmentioning

confidence: 99%

See 1 more Smart Citation

On Training Targets for Deep Learning Approaches to Clean Speech Magnitude Spectrum Estimation

Nicolson¹,

Paliwal²

2020

Preprint

Self Cite

View full text Add to dashboard Cite

The estimation of the clean speech short-time magnitude spectrum (MS) is key for speech enhancement and separation. Moreover, an automatic speech recognition (ASR) system that employs a front-end relies on clean speech MS estimation to remain robust. Training targets for deep learning approaches to clean speech MS estimation fall into three main categories: computational auditory scene analysis (CASA), MS, and minimum mean-square error (MMSE) training targets. In this study, we aim to determine which training target produces enhanced/separated speech at the highest quality and intelligibility, and which is most suitable as a front-end for robust ASR. The training targets were evaluated using a temporal convolutional network (TCN) on the DEMAND Voice Bank and Deep Xi datasets---which include real-world non-stationary and coloured noise sources at multiple SNR levels. Seven objective measures were used, including the word error rate (WER) of the Deep Speech ASR system. We find that MMSE training targets produce the highest objective quality scores. We also find that CASA training targets, in particular the ideal ratio mask (IRM), produce the highest intelligibility scores and perform best as a front-end for robust ASR.

show abstract

Single‐Channel Noise Reduction

2023

Digital Speech Transmission and Enhancement

View full text Add to dashboard Cite

DeepMMSE: A Deep Learning Approach to MMSE-Based Noise Power Spectral Density Estimation

Cited by 115 publications

References 40 publications

On Training Targets for Deep Learning Approaches to Clean Speech Magnitude Spectrum Estimation

On Training Targets for Deep Learning Approaches to Clean Speech Magnitude Spectrum Estimation

On Training Targets for Deep Learning Approaches to Clean Speech Magnitude Spectrum Estimation

Single‐Channel Noise Reduction

Contact Info

Product

Resources

About