2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018
DOI: 10.1109/icassp.2018.8461639
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
169
0
1

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
4
1

Relationship

1
8

Authors

Journals

citations
Cited by 232 publications
(172 citation statements)
references
References 30 publications
2
169
0
1
Order By: Relevance
“…In the line of research on masking-based beamforming, earlier efforts [8], [10]- [14] only use DNN on spectral features to compute a mask for each microphone, and the estimated masks at different microphones are then pooled together to identify T-F units dominated by the same source across all the microphones for covariance matrix computation. Subsequent studies incorporate spatial features such as inter-channel phase differences (IPD) [15], [16], cosine and sine IPD, target direction compensated IPD [17], beamforming results [18], [19], and stacked phases and magnitudes [20], [21] as a way of leveraging spatial information to further improve mask estimation for beamforming. However, these studies aim at improving mask or magnitude estimation, and do not address phase estimation.…”
Section: Introductionmentioning
confidence: 99%
“…In the line of research on masking-based beamforming, earlier efforts [8], [10]- [14] only use DNN on spectral features to compute a mask for each microphone, and the estimated masks at different microphones are then pooled together to identify T-F units dominated by the same source across all the microphones for covariance matrix computation. Subsequent studies incorporate spatial features such as inter-channel phase differences (IPD) [15], [16], cosine and sine IPD, target direction compensated IPD [17], beamforming results [18], [19], and stacked phases and magnitudes [20], [21] as a way of leveraging spatial information to further improve mask estimation for beamforming. However, these studies aim at improving mask or magnitude estimation, and do not address phase estimation.…”
Section: Introductionmentioning
confidence: 99%
“…The conv2d kernel height is set as 2 to span a microphone pair. Note that different configurations of dilation d and stride s on the kernel height axis can extract ICDs from different pairs of signal channels, i.e., m 1 = 1 + (m − 1)s, m 2 = 2 + d + (m − 1)s. For example, for a 6-channel signal, setting dilation as 3 and stride as 1, we can obtain the three pairs of channels: (1,4), (2,5) and (3,6).…”
Section: Spatial Feature Learningmentioning
confidence: 99%
“…Batch normalization (BN) is used in all the experiments to speed up the separation process. The microphone pairs for extracting IPDs and ICDs are (1, 4), (2,5), (3,6), (1,2), (3,4) and (5,6) in all experiments. These pairs are selected because the distance of microphones in between each pair is either the furthest or nearest.…”
Section: Network and Training Detailsmentioning
confidence: 99%
“…Several methods have been proposed for multi-channel speech separation, including DPCL-based Wangyou Zhang and Yanmin Qian were supported by the China NSFC project No.U1736202. methods using integrated beamforming [10] or inter-channel spatial features [11], and a PIT-based method using a multi-speaker mask-based beamformer [12]. For multi-channel multi-speaker speech recognition, an end-to-end system was proposed in [13], called MIMO-Speech because of the multi-channel input (MI) and multi-speaker output (MO).…”
Section: Introductionmentioning
confidence: 99%