Components loss for neural networks in mask-based speech enhancement

Xu, Ziyi; Elshamy, Samy; Zhao, Ziyue; Fingscheidt, Tim

doi:10.1186/s13636-021-00207-6

Cited by 12 publications

(10 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our previous work denoted as "FCRN/PESQNet, [24]" achieves two 1 st -ranked metrics and one 2 nd rank and significantly outperforms the DNS3 baseline [49] in speech quality measured by PESQ. Under both reverberation conditions, the components loss baseline "FCRN/CL [50]" offers around 0.1 points higher PESQ scores compared to the DNS3 baseline [49], but does not perform so well on DNSMOS. Furthermore, the CL baseline "FCRN/CL [50]" offers the worst dereverberation effects reflected by the lowest SRMR scores among all the baseline methods.…”

Section: ) Hyperparameter Optimization and Analysismentioning

confidence: 91%

“…In [50], a components loss (CL) was proposed for training a mask-based speech enhancement neural network, which offers separate controls over preservation of the speech component quality, suppression of the noise component, and preservation of a natural sounding residual noise component. The experimental results of [50] show improved and balanced performance compared to the conventional MSE loss, the approximated differentiable PESQ loss proposed in [28], and the perceptual weighting filter loss proposed in [30], which is based on code-excited linear predictive (CELP) speech coding. We fine-tune the pre-trained DNS model employing CL on D train DNS3 .…”

Section: ) Components Loss Baselinementioning

confidence: 99%

“…We fine-tune the pre-trained DNS model employing CL on D train DNS3 . This baseline method is called "FCRN/CL [50]" in the following analysis of the results.…”

Section: ) Components Loss Baselinementioning

confidence: 99%

“…5) Differentiable PESQ Loss Baseline: In [50], the CL has already shown better performance than the differentiable PESQ loss proposed in [28]. However, we also report on the differentiable PESQ loss [28] as an additional baseline.…”

Section: ) Components Loss Baselinementioning

confidence: 99%

See 3 more Smart Citations

Deep Noise Suppression Maximizing Non-Differentiable PESQ Mediated by a Non-Intrusive PESQNet

Strake

Fingscheidt

2022

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Speech enhancement employing deep neural networks (DNNs) for denoising is called deep noise suppression (DNS). The DNS trained with mean squared error (MSE) losses cannot guarantee good perceptual quality. Perceptual evaluation of speech quality (PESQ) is a widely used metric for evaluating speech quality. However, the original PESQ algorithm is non-differentiable, therefore, cannot directly be used as optimization criterion for gradient-based learning. In this work, we propose an end-to-end non-intrusive PESQNet DNN to estimate the PESQ scores of the enhanced speech signal. Thus, by providing a reference-free perceptual loss, it serves as a mediator towards the DNS training, allowing to maximize the PESQ score of the enhanced speech signal. We illustrate the potential of our proposed PESQNet-mediated training on a strong baseline DNS. As further novelty, we propose to train the DNS and the PESQNet alternatingly to keep the PESQNet up-todate and perform well specifically for the DNS under training. Detailed analysis shows that the PESQNet mediation further increases the DNS performance by about 0.1 PESQ points on synthetic test data and by 0.03 DNSMOS points on real test data, compared to training with the MSE-based loss. Our proposed method outperforms the Interspeech 2021 DNS Challenge baseline by 0.2 PESQ points on synthetic test data and 0.1 DNSMOS points on real test data. Furthermore, it improves on the same DNS trained with an approximated differentiable PESQ loss by about 0.4 PESQ points on synthetic test data and 0.2 DNSMOS points on real test data.

show abstract

Section: ) Hyperparameter Optimization and Analysismentioning

confidence: 91%

Section: ) Components Loss Baselinementioning

confidence: 99%

“…We fine-tune the pre-trained DNS model employing CL on D train DNS3 . This baseline method is called "FCRN/CL [50]" in the following analysis of the results.…”

Section: ) Components Loss Baselinementioning

confidence: 99%

Section: ) Components Loss Baselinementioning

confidence: 99%

See 2 more Smart Citations

Deep Noise Suppression Maximizing Non-Differentiable PESQ Mediated by a Non-Intrusive PESQNet

Strake

Fingscheidt

2022

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

show abstract

“…The network is trained using either ideal binary mask (IBM) or ideal ratio mask (IRM) as training targets [ 20 , 21 ]. Typically, the networks are trained using the mean squared error (MSE) either on the masks or on the reconstructed signal [ 22 , 23 ]. Despite the promising performance achievable in terms of SDR and intelligibility, the presence of artifacts in the reconstructed signals compromises the performance of further processing stages.…”

Section: Introductionmentioning

confidence: 99%

Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models

Ali

Falavigna

Brutti

2022

Sensors

View full text Add to dashboard Cite

Robustness against background noise and reverberation is essential for many real-world speech-based applications. One way to achieve this robustness is to employ a speech enhancement front-end that, independently of the back-end, removes the environmental perturbations from the target speech signal. However, although the enhancement front-end typically increases the speech quality from an intelligibility perspective, it tends to introduce distortions which deteriorate the performance of subsequent processing modules. In this paper, we investigate strategies for jointly training neural models for both speech enhancement and the back-end, which optimize a combined loss function. In this way, the enhancement front-end is guided by the back-end to provide more effective enhancement. Differently from typical state-of-the-art approaches employing on spectral features or neural embeddings, we operate in the time domain, processing raw waveforms in both components. As application scenario we consider intent classification in noisy environments. In particular, the front-end speech enhancement module is based on Wave-U-Net while the intent classifier is implemented as a temporal convolutional network. Exhaustive experiments are reported on versions of the Fluent Speech Commands corpus contaminated with noises from the Microsoft Scalable Noisy Speech Dataset, shedding light and providing insight about the most promising training approaches.

show abstract

SIGANEO: Similarity network with GAN enhancement for immunogenic neoepitope prediction

Ye,

Shen,

Wang

et al. 2023

Computational and Structural Biotechnology Journal

View full text Add to dashboard Cite

Components loss for neural networks in mask-based speech enhancement

Cited by 12 publications

References 43 publications

Deep Noise Suppression Maximizing Non-Differentiable PESQ Mediated by a Non-Intrusive PESQNet

Deep Noise Suppression Maximizing Non-Differentiable PESQ Mediated by a Non-Intrusive PESQNet

Time-Domain Joint Training Strategies of Speech Enhancement and Intent Classification Neural Models

SIGANEO: Similarity network with GAN enhancement for immunogenic neoepitope prediction

Contact Info

Product

Resources

About