2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC) 2018
DOI: 10.1109/iwaenc.2018.8521379
|View full text |Cite
|
Sign up to set email alerts
|

Deep Neural Network Based Speech Separation Optimizing an Objective Estimator of Intelligibility for Low Latency Applications

Abstract: Mean square error (MSE) has been the preferred choice as loss function in the current deep neural network (DNN) based speech separation techniques. In this paper, we propose a new cost function with the aim of optimizing the extended short time objective intelligibility (ESTOI) measure. We focus on applications where low algorithmic latency (≤ 10 ms) is important. We use long short-term memory networks (LSTM) and evaluate our proposed approach on four sets of two-speaker mixtures from extended Danish hearing i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
11
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 18 publications
(12 citation statements)
references
References 26 publications
1
11
0
Order By: Relevance
“…The maximization of STOI [40] during training is also the target in several publications [31][32][33][34][35][36]. In [33], Kolbcek et al derive a differentiable approximation of STOI, which considers the frequency selectivity of the human ear, for the training of a mask-based speech enhancement DNN.…”
Section: Baseline Pw-stoimentioning
confidence: 99%
See 2 more Smart Citations
“…The maximization of STOI [40] during training is also the target in several publications [31][32][33][34][35][36]. In [33], Kolbcek et al derive a differentiable approximation of STOI, which considers the frequency selectivity of the human ear, for the training of a mask-based speech enhancement DNN.…”
Section: Baseline Pw-stoimentioning
confidence: 99%
“…The parameters of the deep learning architectures are then optimized by minimizing the MSE between the inferred results and their corresponding targets. In reality, optimization of the MSE loss in training does not guarantee any perceptual quality of the speech component and of the residual noise component, respectively, which leads to limited performance [27][28][29][30][31][32][33][34][35][36]. This effect is even more evident when the level of the noise component is significantly higher than that of the speech component in some regions of the noisy speech spectrum, which explains the bad performance at lower SNR conditions when training with MSE.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Venkataramani and Smaragdis [31] use an objective function optimisation model that is equivalent to the signal-to-distortion ratio (SDR) measure, which can significantly improve the SDR scores of SE. Later, Kolbaek et al [32] and Naithani et al [33] propose loss functions for optimising the shorttime objective intelligibility (STOI) and the extended STOI, respectively, and Kim et al [34] present a denoising framework with the goal of joint SDR and perceptual evaluation of speech quality (PESQ) optimisation.…”
Section: Introductionmentioning
confidence: 99%
“…[32] and Naithani et al . [33] propose loss functions for optimising the short‐time objective intelligibility (STOI) and the extended STOI, respectively, and Kim et al . [34] present a denoising framework with the goal of joint SDR and perceptual evaluation of speech quality (PESQ) optimisation.…”
Section: Introductionmentioning
confidence: 99%