Exploring Tradeoffs in Models for Low-Latency Speech Enhancement

Wilson, Kevin; Chinen, Michael; Thorpe, Jeremy; Patton, Brian; Hershey, John R.; Saurous, Rif A.; Skoglund, Jan; Lyon, Richard F.

doi:10.1109/iwaenc.2018.8521347

Cited by 44 publications

(38 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, we describe the construction of a dataset for universal sound separation and apply the described combinations of masking networks and analysis-synthesis bases to this task. Systems * and ** come from [14] and [25,26], respectively. "Oracle BM" corresponds to an oracle binary STFT mask, a theoretical upper bound on our systems' performance.…”

Section: Methodsmentioning

confidence: 99%

“…The first masking network we use consists of 14 dilated 2D convolutional layers, a bidirectional LSTM, and two dense layers, which we will refer to as a convolutional-LSTM-dense neural network (CLDNN). The CLDNN is based on a network which achieves state-of-the-art performance on CHiME2 WSJ0 speech enhancement [25] and strong performance on a large internal dataset [26].…”

Section: Masking Network Architecturesmentioning

confidence: 99%

See 1 more Smart Citation

Universal Sound Separation

Kavalerov

Wisdom

Erdoğan

et al. 2019

2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

Self Cite

155

138

View full text Add to dashboard Cite

Recent deep learning approaches have achieved impressive performance on speech enhancement and separation tasks. However, these approaches have not been investigated for separating mixtures of arbitrary sounds of different types, a task we refer to as universal sound separation, and it is unknown how performance on speech tasks carries over to non-speech tasks. To study this question, we develop a dataset of mixtures containing arbitrary sounds, and use it to investigate the space of mask-based separation architectures, varying both the overall network architecture and the framewise analysis-synthesis basis for signal transformations. These network architectures include convolutional long short-term memory networks and time-dilated convolution stacks inspired by the recent success of time-domain enhancement networks like ConvTasNet. For the latter architecture, we also propose novel modifications that further improve separation performance. In terms of the framewise analysis-synthesis basis, we explore both a short-time Fourier transform (STFT) and a learnable basis, as used in ConvTasNet. For both of these bases, we also examine the effect of window size. In particular, for STFTs, we find that longer windows (25-50 ms) work best for speech/non-speech separation, while shorter windows (2.5 ms) work best for arbitrary sounds. For learnable bases, shorter windows (2.5 ms) work best on all tasks. Surprisingly, for universal sound separation, STFTs outperform learnable bases. Our best methods produce an improvement in scale-invariant signal-todistortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Masking Network Architecturesmentioning

confidence: 99%

Universal Sound Separation

Kavalerov

Wisdom

Erdoğan

et al. 2019

2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

Self Cite

155

138

View full text Add to dashboard Cite

show abstract

“…where S and S −1 are forward and inverse STFT operators. Such masking-based DNN approaches have been very successful [1,2,3,4]. However, existing approaches have two deficiencies.…”

Section: Introductionmentioning

confidence: 99%

Differentiable Consistency Constraints for Improved Deep Speech Enhancement

Wisdom

Hershey

Wilson

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

101

View full text Add to dashboard Cite

In recent years, deep networks have led to dramatic improvements in speech enhancement by framing it as a data-driven pattern recognition problem. In many modern enhancement systems, large amounts of data are used to train a deep network to estimate masks for complex-valued short-time Fourier transforms (STFTs) to suppress noise and preserve speech. However, current masking approaches often neglect two important constraints: STFT consistency and mixture consistency. Without STFT consistency, the system's output is not necessarily the STFT of a time-domain signal, and without mixture consistency, the sum of the estimated sources does not necessarily equal the input mixture. Furthermore, the only previous approaches that apply mixture consistency use real-valued masks; mixture consistency has been ignored for complex-valued masks.In this paper, we show that STFT consistency and mixture consistency can be jointly imposed by adding simple differentiable projection layers to the enhancement network. These layers are compatible with real or complex-valued masks. Using both of these constraints with complex-valued masks provides a 0.7 dB increase in scale-invariant signal-to-distortion ratio (SI-SDR) on a large dataset of speech corrupted by a wide variety of nonstationary noise across a range of input SNRs.

show abstract

“…The performance of speech enhancement has leaped significantly with the introduction of deep neural networks (DNNs) in recent years, e.g., [1][2][3][4][5][6][7][8][9][10][11][12]. This advance can be attributed to DNNs being unencumbered by explicit or implicit constraints on relevant data probability distributions, replacing these distributions with empirical observations, and by the ability of DNNs to capture complex relationships that cannot be expressed analytically.…”

Section: Introductionmentioning

confidence: 99%

“…It follows from (1) that a natural objective for enhancement is to find a good approximation of the clean speech waveform x based on the noisy observations y and available prior knowledge. Optimizing a network to minimize a measure of error between a clean signal estimatex and the ground-truth x is a common approach that has yielded state of the art results based on DNN approaches operating in the time-frequency domain [2][3][4][5][6][7][8] or time-only domain [10].…”

Section: Introductionmentioning

confidence: 99%

Generative Speech Enhancement Based on Cloned Networks

Chinen

Kleijn

Lim

et al. 2019

2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

Self Cite

View full text Add to dashboard Cite

We propose to implement speech enhancement by the regeneration of clean speech from a 'salient' representation extracted from the noisy signal. The network that extracts salient features is trained using a set of weight-sharing clones of the extractor network. The clones receive mel-frequency spectra of different noisy versions of the same speech signal as input. By encouraging the outputs of the clones to be similar for these different input signals, we train a feature extractor network that is robust to noise. At inference, the salient features form the input to a WaveNet network that generates a natural and clean speech signal with the same attributes as the ground-truth clean signal. As the signal becomes noisier, our system produces natural sounding errors that stay on the speech manifold, in place of traditional artifacts found in other systems. Our experiments confirm that our generative enhancement system provides state-of-the-art enhancement performance within the generative class of enhancers according to a MUSHRA-like test. The clones based system matches or outperforms the other systems at each input signal-to-noise (SNR) range with statistical significance.

show abstract

Exploring Tradeoffs in Models for Low-Latency Speech Enhancement

Cited by 44 publications

References 11 publications

Universal Sound Separation

Universal Sound Separation

Differentiable Consistency Constraints for Improved Deep Speech Enhancement

Generative Speech Enhancement Based on Cloned Networks

Contact Info

Product

Resources

About