ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053092
|View full text |Cite
|
Sign up to set email alerts
|

Enhancing End-to-End Multi-Channel Speech Separation Via Spatial Feature Learning

Abstract: Hand-crafted spatial features (e.g., inter-channel phase difference, IPD) play a fundamental role in recent deep learning based multi-channel speech separation (MCSS) methods. However, these manually designed spatial features are hard to incorporate into the end-to-end optimized MCSS framework. In this work, we propose an integrated architecture for learning spatial features directly from the multi-channel speech waveforms within an end-to-end speech separation framework. In this architecture, time-domain filt… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
26
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 56 publications
(26 citation statements)
references
References 21 publications
0
26
0
Order By: Relevance
“…Considerable progress has been made towards solving the talker-independent speaker separation problem, since deep clustering (DC) [1] and permutation invariant training (PIT) [2] were proposed to address the label permutation problem. To further improve separation, subsequent studies leverage microphone array processing [3]- [6], magnitude-and complex-domain phase estimation [7], [8], time-domain processing [9], and extra information such as speaker embeddings [10] and visual cues [11]. On wsj0-2mix and 3mix [1], a popular benchmark dataset containing monaural anechoic twoand three-speaker mixtures, current state-of-the-art approaches produce separation results that sound almost indistinguishable from clean speech, and the performance improvement measured by scaleinvariant signal-to-distortion ratio is more than 20 dB over no processing [12].…”
Section: Introductionmentioning
confidence: 99%
“…Considerable progress has been made towards solving the talker-independent speaker separation problem, since deep clustering (DC) [1] and permutation invariant training (PIT) [2] were proposed to address the label permutation problem. To further improve separation, subsequent studies leverage microphone array processing [3]- [6], magnitude-and complex-domain phase estimation [7], [8], time-domain processing [9], and extra information such as speaker embeddings [10] and visual cues [11]. On wsj0-2mix and 3mix [1], a popular benchmark dataset containing monaural anechoic twoand three-speaker mixtures, current state-of-the-art approaches produce separation results that sound almost indistinguishable from clean speech, and the performance improvement measured by scaleinvariant signal-to-distortion ratio is more than 20 dB over no processing [12].…”
Section: Introductionmentioning
confidence: 99%
“…Recently, neural network based multi-channel speech separation approaches have achieved state-of-the-art performance by directly processing time-domain speech signals [12,14]. These systems incorporate a spectral encoder, a spatial encoder, a separator, and a decoder.…”
Section: Multi-channel End-to-end Extractionmentioning
confidence: 99%
“…Actually, one of the keys of the TSE is still speech separation. In order to further enhance the separation ability, many strategies for exploiting the multi-channel information have been recently proposed, such as normalized crosscorrelation (NCC) [15], transform-average-concatenate (TAC) [16], and inter-channel convolution difference (ICD) [17], etc. Therefore, how to effectively exploit the multi-channel spatial information for TSE is crucial.…”
Section: Introductionmentioning
confidence: 99%