The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results

Reddy, Chandan K.; Gopal, Vishak; Cutler, Ross; Beyrami, Ebrahim; Cheng, Roger; Dubey, Harishchandra; Matusevych, Sergiy; Aichner, Robert; Aazami, Ashkan; Braun, Sebastian; Rana, Puneet; Srinivasan, Sriram; Gehrke, Johannes

doi:10.21437/interspeech.2020-3038

Cited by 218 publications

(94 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When training Tiny DCU-Net-16, we use the clean speech and noise speech dataset released from the INTERSPEECH 2020 DNS challenge [24]. The number of its trainable parameters is 108162 ≈ 108K.…”

Section: Methodsmentioning

confidence: 99%

ICASSP 2021 Acoustic Echo Cancellation Challenge: Integrated Adaptive Echo Cancellation with Time Alignment and Deep Learning-Based Residual Echo Plus Noise Suppression

Peng

Cheng

Zheng

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This paper describes a three-stage acoustic echo cancellation (AEC) and suppression framework for the ICASSP 2021 AEC Challenge. In the first stage, a partitioned block frequency domain adaptive filtering is implemented to cancel the linear echo components without introducing the near-end speech distortion, where we compensate the time delay between the far-end reference signal and the microphone signal beforehand. In the second stage, a deep complex U-Net integrated with gated recurrent unit is proposed to further suppress the residual echo components. In the last stage, an extremely tiny deep complex U-Net is trained to suppress non-speech residual components that have not been suppressed completely in the second stage, which can also further increase the echo return loss enhancement (ERLE) without increasing the computational complexity dramatically. Experimental results show that the proposed three-stage framework can get the ERLE higher than 50 dB in both single-talk and double-talk scenarios, and perceptual evaluation of speech quality can be improved about 0.75 in double-talk scenarios. The proposed framework outperforms the AEC-Challenge baseline ResRNN by 0.12 points in terms of the MOS.

show abstract

Section: Methodsmentioning

confidence: 99%

ICASSP 2021 Acoustic Echo Cancellation Challenge: Integrated Adaptive Echo Cancellation with Time Alignment and Deep Learning-Based Residual Echo Plus Noise Suppression

Peng

Cheng

Zheng

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…For example, the recordings from denoising algorithms may contain residuals of noise and reverberation, and the synthesized speech from vocoders may contain robotic sound. Therefore, to match the test-time conditions, we add 15-25dB noise randomly drawn from the DNS Challenge Dataset [36] to the input narrowband signal during training.…”

Section: Noise Augmentationmentioning

confidence: 99%

Bandwidth Extension is All You Need

Wang

Finkelstein

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Speech generation and enhancement have seen recent breakthroughs in quality thanks to deep learning. These methods typically operate at a limited sampling rate of 16-22kHz due to computational complexity and available datasets. This limitation imposes a gap between the output of such methods and that of high-fidelity (≥44kHz) real-world audio applications. This paper proposes a new bandwidth extension (BWE) method that expands 8-16kHz speech signals to 48kHz. The method is based on a feed-forward WaveNet architecture trained with a GAN-based deep feature loss. A mean-opinionscore (MOS) experiment shows significant improvement in quality over state-of-the-art BWE methods. An AB test reveals that our 16to-48kHz BWE is able to achieve fidelity that is typically indistinguishable from real high-fidelity recordings. We use our method to enhance the output of recent speech generation and denoising methods, and experiments demonstrate significant improvement in sound quality over these baselines. We propose this as a general approach to narrow the gap between generated speech and recorded speech, without the need to adapt such methods to higher sampling rates.

show abstract

“…All considered algorithms were trained and evaluated on the DNS Challenge dataset [22]. In total, this dataset contains more than 500 h of speech from 2150 speakers and 180 h of noise from 150 different noise classes at a sampling frequency of 16 kHz.…”

Section: Datasetmentioning

confidence: 99%

“…More in particular, we propose to train temporal convolutional networks [11,20] to map the noisy speech STFT coefficients to the required quantities, i.e., the noise correlation matrix and the a-priori SNR, by minimizing the scale-invariant signal-to-distortion ratio loss function [21] at the MFMVDR filter output. Experimental results using the INTERSPEECH 2020 Deep Noise Suppression (DNS) Challenge dataset [22] show that the proposed deep MFMVDR filter outperforms complex-valued masking as well as directly estimating the multi-frame filter without exploiting the MFMVDR structure and Conv-TasNet [11].…”

Section: Introductionmentioning

confidence: 99%

Deep Multi-Frame MVDR Filtering for Single-Microphone Speech Enhancement

Tammen

Doclo

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Multi-frame algorithms for single-microphone speech enhancement, e.g., the multi-frame minimum variance distortionless response (MFMVDR) filter, are able to exploit speech correlation across adjacent time frames in the short-time Fourier transform (STFT) domain. Provided that accurate estimates of the required speech interframe correlation vector and the noise correlation matrix are available, it has been shown that the MFMVDR filter yields a substantial noise reduction while hardly introducing any speech distortion. Aiming at merging the speech enhancement potential of the MFMVDR filter and the estimation capability of temporal convolutional networks (TCNs), in this paper we propose to embed the MFMVDR filter within a deep learning framework. The TCNs are trained to map the noisy speech STFT coefficients to the required quantities by minimizing the scale-invariant signal-to-distortion ratio loss function at the MFMVDR filter output. Experimental results show that the proposed deep MFMVDR filter achieves a competitive speech enhancement performance on the Deep Noise Suppression Challenge dataset. In particular, the results show that estimating the parameters of an MFMVDR filter yields a higher performance in terms of PESQ and STOI than directly estimating the multi-frame filter or single-frame masks and than Conv-TasNet.

show abstract

The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Testing Framework, and Challenge Results

Cited by 218 publications

References 0 publications

ICASSP 2021 Acoustic Echo Cancellation Challenge: Integrated Adaptive Echo Cancellation with Time Alignment and Deep Learning-Based Residual Echo Plus Noise Suppression

ICASSP 2021 Acoustic Echo Cancellation Challenge: Integrated Adaptive Echo Cancellation with Time Alignment and Deep Learning-Based Residual Echo Plus Noise Suppression

Bandwidth Extension is All You Need

Deep Multi-Frame MVDR Filtering for Single-Microphone Speech Enhancement

Contact Info

Product

Resources

About