2020
DOI: 10.1109/taslp.2020.2968738
|View full text |Cite
|
Sign up to set email alerts
|

On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement

Abstract: Many deep learning-based speech enhancement algorithms are designed to minimize the mean-square error (MSE) in some transform domain between a predicted and a target speech signal. However, optimizing for MSE does not necessarily guarantee high speech quality or intelligibility, which is the ultimate goal of many speech enhancement algorithms. Additionally, only little is known about the impact of the loss function on the emerging class of time-domain deep learning-based speech enhancement systems.We study h… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
51
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 116 publications
(51 citation statements)
references
References 57 publications
(79 reference statements)
0
51
0
Order By: Relevance
“…U-Net is chosen as the network in this letter, which has been widely adopted for speech separation task [30]. As shown in Fig.…”
Section: B Network Architecturementioning
confidence: 99%
See 1 more Smart Citation
“…U-Net is chosen as the network in this letter, which has been widely adopted for speech separation task [30]. As shown in Fig.…”
Section: B Network Architecturementioning
confidence: 99%
“…This letter chooses three loss functions including MSE in (4), Time-MSE-based loss (TMSE) [30] and recently proposed SI-SDR-based loss [30] as baselines. AS T-F domainbased network is used, an additional fixed iSTFT-like layer is needed to transform the estimated T-F spectrum back into time domain for TMSE-and SI-SDR-based loss [31].…”
Section: Loss Functions and Training Modelsmentioning
confidence: 99%
“…Another class of SE methods proposes to directly perform enhancement on the raw waveform [27]- [31], which are generally called waveform-mapping-based approaches. Among the deep learning models, fully convolutional networks (FCNs) have been widely used to directly perform waveform mapping [28], [32]- [34].…”
Section: Introductionmentioning
confidence: 99%
“…With the current rapid strides in neural networks (NNs) and deep learning, several sophisticated architectures have been proposed and successfully used for single-channel source separation [1,2,3,4]. More recently, we have started to operate directly on the waveforms with several end-to-end approaches available [2,5,6], and use better cost-functions motivated by the Source-to-Distortion ratio (SDR) [7,8,9,10,11,2]. Using deep-clustering [1] and permutation-invariant training [12], we can train the networks to perform speaker-independent source separation.…”
Section: Introductionmentioning
confidence: 99%