Monaural Source Separation: From Anechoic To Reverberant Environments

Cord-Landwehr, Tobias; Böddeker, Christoph; Neumann, Thilo von; Zorilă, Cătălin; Doddipatla, Rama; Haeb‐Umbach, Reinhold

doi:10.1109/iwaenc53105.2022.9914794

Cited by 24 publications

(8 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This approach allows direct optimization of the features for the given task. However, the choice of input features may depend on the acoustic conditions, and some have reported superior performance using STFT under challenging reverberant conditions [48] or using handcrafted filterbanks [49].…”

Section: E Considerations When Designing a Tse Systemmentioning

confidence: 99%

Neural Target Speech Extraction: An Overview

Žmolíková¹,

Delcroix²,

Ochiai³

et al. 2023

Preprint

View full text Add to dashboard Cite

Humans can listen to a target speaker even in challenging acoustic conditions that have noise, reverberation, and interfering speakers. This phenomenon is known as the cocktail-party effect. For decades, researchers have focused on approaching the listening ability of humans. One critical issue is handling interfering speakers because the target and non-target speech signals share similar characteristics, complicating their discrimination. Target speech/speaker extraction (TSE) isolates the speech signal of a target speaker from a mixture of several speakers with or without noises and reverberations using clues that identify the speaker in the mixture. Such clues might be a spatial clue indicating the direction of the target speaker, a video of the speaker's lips, or a pre-recorded enrollment utterance from which their voice characteristics can be derived. TSE is an emerging field of research that has received increased attention in recent years because it offers a practical approach to the cocktailparty problem and involves such aspects of signal processing as audio, visual, array processing, and deep learning. This paper focuses on recent neural-based approaches and presents an in-depth overview of TSE. We guide readers through the different major approaches, emphasizing the similarities among frameworks and discussing potential future directions.

show abstract

Section: E Considerations When Designing a Tse Systemmentioning

confidence: 99%

Neural Target Speech Extraction: An Overview

Žmolíková¹,

Delcroix²,

Ochiai³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…There are several choices for the training objective of the TS-SEP system. We opted for a time-domain reconstruction loss, since it implicitly accounts for phase information (see [30], [31] for a discussion of time-vs. frequency-domain reconstruction losses). To be specific, a time-domain signal reconstruction loss is computed by applying an inverse STFT to Xt,f,k to obtain x ,k and measuring the logarithmic mean absolute error (LogMAE) from the ground truth x ,k :…”

Section: B Training Schedule and Objectivementioning

confidence: 99%

An Initialization Scheme for Meeting Separation with Spatial Mixture Models

Boeddeker¹,

Cord-Landwehr²,

Neumann³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

Since diarization and source separation of meeting data are closely related tasks, we here propose an approach to perform the two objectives jointly. It builds upon the targetspeaker voice activity detection (TS-VAD) diarization approach, which assumes that initial speaker embeddings are available. We replace the final combined speaker activity estimation network of TS-VAD with a network that produces speaker activity estimates at a time-frequency resolution. Those act as masks for source extraction, either via masking or via beamforming. The technique can be applied both for single-channel and multi-channel input and, in both cases, achieves a new state-of-the-art word error rate (WER) on the LibriCSS meeting data recognition task. We further compute speaker-aware and speaker-agnostic WERs to isolate the contribution of diarization errors to the overall WER performance.

show abstract

“…However, this performance still degrades heavily under noisy reverberant conditions [7]. This performance loss can be alleviated somewhat with careful hyper-parameter optimization but a significant performance gap still exists [8].…”

Section: Introductionmentioning

confidence: 99%

Deformable Temporal Convolutional Networks for Monaural Noisy Reverberant Speech Separation

Ravenscroft

Goetze

Hain

2023

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Speech separation models are used for isolating individual speakers in many speech processing applications. Deep learning models have been shown to lead to state-of-the-art (SOTA) results on a number of speech separation benchmarks. One such class of models known as temporal convolutional networks (TCNs) has shown promising results for speech separation tasks. A limitation of these models is that they have a fixed receptive field (RF). Recent research in speech dereverberation has shown that the optimal RF of a TCN varies with the reverberation characteristics of the speech signal. In this work deformable convolution is proposed as a solution to allow TCN models to have dynamic RFs that can adapt to various reverberation times for reverberant speech separation. The proposed models are capable of achieving an 11.1 dB average scale-invariant signalto-distortion ratio (SISDR) improvement over the input signal on the WHAMR benchmark. A relatively small deformable TCN model of 1.3M parameters is proposed which gives comparable separation performance to larger and more computationally complex models.

show abstract

Monaural Source Separation: From Anechoic To Reverberant Environments

Cited by 24 publications

References 24 publications

Neural Target Speech Extraction: An Overview

Neural Target Speech Extraction: An Overview

An Initialization Scheme for Meeting Separation with Spatial Mixture Models

Deformable Temporal Convolutional Networks for Monaural Noisy Reverberant Speech Separation

Contact Info

Product

Resources

About