2019
DOI: 10.1109/taslp.2018.2881912
|View full text |Cite
|
Sign up to set email alerts
|

Combining Spectral and Spatial Features for Deep Learning Based Blind Speaker Separation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
54
0
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 122 publications
(55 citation statements)
references
References 41 publications
0
54
0
1
Order By: Relevance
“…We simulated a spatialized reverberant dataset derived from Wall Street Journal 0 (WSJ0) 2-mix corpus, which are open and well-studied datasets used for speech separation [9,[16][17][18]. There are 20,000, 5,000 and 3,000 multi-channel, reverberant, two-speaker mixed speech in training, development and test set respectively.…”
Section: Datasetmentioning
confidence: 99%
“…We simulated a spatialized reverberant dataset derived from Wall Street Journal 0 (WSJ0) 2-mix corpus, which are open and well-studied datasets used for speech separation [9,[16][17][18]. There are 20,000, 5,000 and 3,000 multi-channel, reverberant, two-speaker mixed speech in training, development and test set respectively.…”
Section: Datasetmentioning
confidence: 99%
“…Step 1, Estimating the first DOA: In the first step we estimate the DOA of a first speaker using a neural network. The cosines and sines of the phase differences between all pairs of microphones [6,14], called cosine-sine interchannel phase difference (CSIPD) features, and the short-term magnitude spectrum of one of the channels (in the following, channel 1) are used as input features (see Section 3.2):…”
Section: Estimating the Sourcesmentioning
confidence: 99%
“…Computational auditory scene analysis [5] based systems cluster time-frequency bins dominated by the same source using cues such as pitch and interaural time and level differences. In [6], a neural network is trained using the phase differences of the multichannel short-time Fourier transform (STFT) as input features to learn such cues. Deep clustering [1] is then used to associate the cues to the right source.…”
Section: Introductionmentioning
confidence: 99%
“…There exist several unsupervised approaches for multichannel speech source separation including independent component analysis based methods [6][7][8] and local Gaussian model (LGM) based method [9]. Meanwhile, motivated by the strong capability of a deep neural network (DNN) to model a spectrogram of a speech, supervised approaches have been paid increasing attention [10][11][12][13].…”
Section: Introductionmentioning
confidence: 99%
“…Recently, using PIT, speaker-independent multi-talker separation by mask-based beamforming was presented [10][11][12]. In these studies, loss functions designed for monaural speech enhancement/separation, such as the phase sensitive approximation (PSA) [27], are employed in the training.…”
Section: Introductionmentioning
confidence: 99%