Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1827
|View full text |Cite
|
Sign up to set email alerts
|

Probabilistic Permutation Invariant Training for Speech Separation

Abstract: Single-microphone, speaker-independent speech separation is normally performed through two steps: (i) separating the specific speech sources, and (ii) determining the best output-label assignment to find the separation error. The second step is the main obstacle in training neural networks for speech separation. Recently proposed Permutation Invariant Training (PIT) addresses this problem by determining the output-label assignment which minimizes the separation error. In this study, we show that a major drawba… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
11
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
2

Relationship

2
5

Authors

Journals

citations
Cited by 15 publications
(12 citation statements)
references
References 37 publications
1
11
0
Order By: Relevance
“…Row (a) is for the baseline separation model TasNet-v2 [4] we used throughout this work. Rows (b)(c) are respectively the latest version of TasNet and our implementation using Prob-PIT [14] with TasNet-v2. Row (d) is for our previous work of cross-domain joint clustering which was not used here at all.…”
Section: Summary Of the Resultsmentioning
confidence: 99%
“…Row (a) is for the baseline separation model TasNet-v2 [4] we used throughout this work. Rows (b)(c) are respectively the latest version of TasNet and our implementation using Prob-PIT [14] with TasNet-v2. Row (d) is for our previous work of cross-domain joint clustering which was not used here at all.…”
Section: Summary Of the Resultsmentioning
confidence: 99%
“…Prob-PIT [7] considers the probabilities of all utterance level permutations, rather than just the best one, improving the initial training stage when wrong alignments are likely to happen. A similar idea is employed by Yang et al [8], who trained a Conv-TasNet with uPIT and fixed alignments in turns, reporting 17.5 dB SI-SNRi.…”
Section: Additional Discussion: Previous Results On Wsj0-2mixmentioning
confidence: 99%
“…However, the permutation frequently changes over frames at inference time. Improvements to PIT roughly fall into two categories: (i) designing a permutation (or speaker) tracking algorithm for tPIT [2,4,5]; and (ii) designing better uPIT objectives to further strengthen permutation consistency [6][7][8][9]. Along these two lines, our work takes a close look at tPIT+clustering, a recent idea introduced by Deep CASA [2], that targets at accurate frame level separation (tPIT) and speaker tracking (clustering) in two stages.…”
Section: Introductionmentioning
confidence: 99%
“…In this work we use a multi-speaker, sentence-based corpus called GRID, which has been used in monaural speech separation and recognition challenge [18]. Also, this dataset has been used in several studies [12,19] for overlapping speech detection and separation. This corpus contains 34 speakers, 16 female and 18 male speakers, each narrating 1000 sentence.…”
Section: Problem Formulationmentioning
confidence: 99%