2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC) 2018
DOI: 10.1109/iwaenc.2018.8521347
|View full text |Cite
|
Sign up to set email alerts
|

Exploring Tradeoffs in Models for Low-Latency Speech Enhancement

Abstract: We explore a variety of neural networks configurations for one-and two-channel spectrogram-mask-based speech enhancement. Our best model improves on previous state-ofthe-art performance on the CHiME2 speech enhancement task by 0.4 decibels in signal-to-distortion ratio (SDR). We examine trade-offs such as non-causal look-ahead, computation, and parameter count versus enhancement performance and find that zero-look-ahead models can achieve, on average, within 0.03 dB SDR of our best bidirectional model. Further… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
36
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
2
1

Relationship

3
4

Authors

Journals

citations
Cited by 44 publications
(38 citation statements)
references
References 11 publications
0
36
0
Order By: Relevance
“…In this section, we describe the construction of a dataset for universal sound separation and apply the described combinations of masking networks and analysis-synthesis bases to this task. Systems * and ** come from [14] and [25,26], respectively. "Oracle BM" corresponds to an oracle binary STFT mask, a theoretical upper bound on our systems' performance.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…In this section, we describe the construction of a dataset for universal sound separation and apply the described combinations of masking networks and analysis-synthesis bases to this task. Systems * and ** come from [14] and [25,26], respectively. "Oracle BM" corresponds to an oracle binary STFT mask, a theoretical upper bound on our systems' performance.…”
Section: Methodsmentioning
confidence: 99%
“…The first masking network we use consists of 14 dilated 2D convolutional layers, a bidirectional LSTM, and two dense layers, which we will refer to as a convolutional-LSTM-dense neural network (CLDNN). The CLDNN is based on a network which achieves state-of-the-art performance on CHiME2 WSJ0 speech enhancement [25] and strong performance on a large internal dataset [26].…”
Section: Masking Network Architecturesmentioning
confidence: 99%
“…where S and S −1 are forward and inverse STFT operators. Such masking-based DNN approaches have been very successful [1,2,3,4]. However, existing approaches have two deficiencies.…”
Section: Introductionmentioning
confidence: 99%
“…The performance of speech enhancement has leaped significantly with the introduction of deep neural networks (DNNs) in recent years, e.g., [1][2][3][4][5][6][7][8][9][10][11][12]. This advance can be attributed to DNNs being unencumbered by explicit or implicit constraints on relevant data probability distributions, replacing these distributions with empirical observations, and by the ability of DNNs to capture complex relationships that cannot be expressed analytically.…”
Section: Introductionmentioning
confidence: 99%
“…It follows from (1) that a natural objective for enhancement is to find a good approximation of the clean speech waveform x based on the noisy observations y and available prior knowledge. Optimizing a network to minimize a measure of error between a clean signal estimatex and the ground-truth x is a common approach that has yielded state of the art results based on DNN approaches operating in the time-frequency domain [2][3][4][5][6][7][8] or time-only domain [10].…”
Section: Introductionmentioning
confidence: 99%