ICASSP 2021 Deep Noise Suppression Challenge: Decoupling Magnitude and Phase Optimization with a Two-Stage Deep Network

Li, Andong; Liu, Wenzhe; Luo, Xiaoxue; Zheng, Chengshi; Li, Xiaodong

doi:10.1109/icassp39728.2021.9414062

Cited by 45 publications

(22 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2, where more noise is suppressed when CP is applied. Finally, when post-processing is applied, PESQ is decreased due to some spectrum information lost [33]. However the use of post-processing is beneficial to subjective listening as shown in previous works [27,33] because unnatural residual noise is further suppressed.…”

Section: Resultsmentioning

confidence: 94%

DCCRN+: Channel-Wise Subband DCCRN with SNR Estimation for Speech Enhancement

Lv¹,

Zhang

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

Deep complex convolution recurrent network (DCCRN), which extends CRN with complex structure, has achieved superior performance in MOS evaluation in Interspeech 2020 deep noise suppression challenge (DNS2020). This paper further extends DCCRN with the following significant revisions. We first extend the model to sub-band processing where the bands are split and merged by learnable neural network filters instead of engineered FIR filters, leading to a faster noise suppressor trained in an end-to-end manner. Then the LSTM is further substituted with a complex TF-LSTM to better model temporal dependencies along both time and frequency axes. Moreover, instead of simply concatenating the output of each encoder layer to the input of the corresponding decoder layer, we use convolution blocks to first aggregate essential information from the encoder output before feeding it to the decoder layers. We specifically formulate the decoder with an extra a priori SNR estimation module to maintain good speech quality while removing noise. Finally a post-processing module is adopted to further suppress the unnatural residual noise. The new model, named DCCRN+, has surpassed the original DCCRN as well as several competitive models in terms of PESQ and DNSMOS, and has achieved superior performance in the new Interspeech 2021 DNS challenge.

show abstract

Section: Resultsmentioning

confidence: 94%

DCCRN+: Channel-Wise Subband DCCRN with SNR Estimation for Speech Enhancement

Lv¹,

Zhang

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

show abstract

“…To ensure consistency in the optimization of the RI and magnitude spectrum, we adopt the loss function form of combined mean square error (cMSE) in [2], as follows:…”

Section: Loss Functionmentioning

confidence: 99%

“…However, most of the previous studies on speech enhancement are for narrow-band (8 kHz) or wide-band (16 kHz) audio, and there are few methods for 48 kHz full-band audio. Deep learning-based speech enhancement methods [1,2,3] have achieved impressive performance on wide-band audio, but the lack of sufficient training data has become a major limitation for full-band deep learning speech enhancement methods. The recent 4th Microsoft * Equal contribution Deep Noise Suppression (DNS-4) Challenge 1 extends efforts to full-band single-channel speech enhancement tasks with a massive training dataset and real-scenario test set.…”

Section: Introductionmentioning

confidence: 99%

FB-MSTCN: A Full-Band Single-Channel Speech Enhancement Method Based on Multi-Scale Temporal Convolutional Network

Zhang¹,

Zhang²,

Zhuang³

et al. 2022

Preprint

View full text Add to dashboard Cite

In recent years, deep learning-based approaches have significantly improved the performance of single-channel speech enhancement. However, due to the limitation of training data and computational complexity, real-time enhancement of fullband (48 kHz) speech signals is still very challenging. Because of the low energy of spectral information in the highfrequency part, it is more difficult to directly model and enhance the full-band spectrum using neural networks. To solve this problem, this paper proposes a two-stage real-time speech enhancement model with extraction-interpolation mechanism for a full-band signal. The 48 kHz full-band time-domain signal is divided into three sub-channels by extracting, and a two-stage processing scheme of 'masking + compensation' is proposed to enhance the signal in the complex domain. After the two-stage enhancement, the enhanced full-band speech signal is restored by interval interpolation. In the subjective listening and word accuracy test, our proposed model achieves superior performance and outperforms the baseline model overall by 0.59 MOS and 4.0% WAcc for the nonpersonalized speech denoising task.

show abstract

“…However, they are targeted at teleconferencing scenarios, where a processing latency as large as 40 ms is allowed. For example, DCCRN [14] has an algorithmic latency of 62.5 ms and TSCN-PP [53] 20 ms. In addition, these models share many similarities with our complex T-F domain DNN models and can straightforwardly leverage our proposed techniques to reduce their algorithmic latency.…”

Section: B Benchmark Systemsmentioning

confidence: 99%

STFT-Domain Neural Speech Enhancement with Very Low Algorithmic Latency

Wang¹,

Wichern²,

Watanabe³

et al. 2022

Preprint

View full text Add to dashboard Cite

Deep learning based speech enhancement in the short-term Fourier transform (STFT) domain typically uses a large window length such as 32 ms. A larger window contains more samples and the frequency resolution can be higher for potentially better enhancement. This however incurs an algorithmic latency of 32 ms in an online setup, because the overlap-add algorithm used in the inverse STFT (iSTFT) is also performed based on the same 32 ms window size. To reduce this inherent latency, we adapt a conventional dual window size approach, where a regular input window size is used for STFT but a shorter output window is used for the overlap-add in the iSTFT, for STFTdomain deep learning based frame-online speech enhancement. Based on this STFT and iSTFT configuration, we employ singleor multi-microphone complex spectral mapping for frame-online enhancement, where a deep neural network (DNN) is trained to predict the real and imaginary (RI) components of target speech from the mixture RI components. In addition, we use the RI components predicted by the DNN to conduct frameonline beamforming, the results of which are then used as extra features for a second DNN to perform frame-online post-filtering. The frequency-domain beamforming in between the two DNNs can be easily integrated with complex spectral mapping and is designed to not incur any algorithmic latency. Additionally, we propose a future-frame prediction technique to further reduce the algorithmic latency. Evaluation results on a noisy-reverberant speech enhancement task demonstrate the effectiveness of the proposed algorithms. Compared with Conv-TasNet, our STFTdomain system can achieve better enhancement performance for a comparable amount of computation, or comparable performance with less computation, maintaining strong performance at an algorithmic latency as low as 2 ms.

show abstract

ICASSP 2021 Deep Noise Suppression Challenge: Decoupling Magnitude and Phase Optimization with a Two-Stage Deep Network

Cited by 45 publications

References 29 publications

DCCRN+: Channel-Wise Subband DCCRN with SNR Estimation for Speech Enhancement

DCCRN+: Channel-Wise Subband DCCRN with SNR Estimation for Speech Enhancement

FB-MSTCN: A Full-Band Single-Channel Speech Enhancement Method Based on Multi-Scale Temporal Convolutional Network

STFT-Domain Neural Speech Enhancement with Very Low Algorithmic Latency

Contact Info

Product

Resources

About