Improved Separation of Closely-spaced Speakers by Exploiting Auxiliary Direction of Arrival Information within a U-Net Architecture

Kindt, Stijn; Bohlender, Alexander; Madhu, Nilesh

doi:10.1109/avss56176.2022.9959632

Cited by 4 publications

(1 citation statement)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This avoids an implicit far-field assumption and leads to better performance, as we showed in [26]. Similarly, Kindt et al [31] have shown that a learned encoding based on a one-hot encoded angle used as a feature to improve separation of closely spaced speakers is more valuable than a hand-crafted feature based on expected phase differences.…”

Section: Introductionsupporting

confidence: 56%

Insights Into Deep Non-Linear Filters for Improved Multi-Channel Speech Enhancement

Tesch¹,

Gerkmann²

2023

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

In a multi-channel separation task with multiple speakers, we aim to recover all individual speech signals from the mixture. In contrast to single-channel approaches, which rely on the different spectro-temporal characteristics of the speech signals, multi-channel approaches should additionally utilize the different spatial locations of the sources for a more powerful separation especially when the number of sources increases. To enhance the spatial processing in a multi-channel source separation scenario, in this work, we propose a deep neural network (DNN) based spatially selective filter (SSF) that can be spatially steered to extract the speaker of interest by initializing a recurrent neural network layer with the target direction. We compare the proposed SSF with a common endto-end direct separation (DS) approach trained using utterancewise permutation invariant training (PIT), which only implicitly learns to perform spatial filtering. We show that the SSF has a clear advantage over a DS approach with the same underlying network architecture when there are more than two speakers in the mixture, which can be attributed to a better use of the spatial information. Furthermore, we find that the SSF generalizes much better to additional noise sources that were not seen during training.

show abstract

Section: Introductionsupporting

confidence: 56%