“…Considerable progress has been made towards solving the talker-independent speaker separation problem, since deep clustering (DC) [1] and permutation invariant training (PIT) [2] were proposed to address the label permutation problem. To further improve separation, subsequent studies leverage microphone array processing [3]- [6], magnitude-and complex-domain phase estimation [7], [8], time-domain processing [9], and extra information such as speaker embeddings [10] and visual cues [11]. On wsj0-2mix and 3mix [1], a popular benchmark dataset containing monaural anechoic twoand three-speaker mixtures, current state-of-the-art approaches produce separation results that sound almost indistinguishable from clean speech, and the performance improvement measured by scaleinvariant signal-to-distortion ratio is more than 20 dB over no processing [12].…”