Probabilistic Permutation Invariant Training for Speech Separation

Yousefi, Midia; Khorram, Soheil; Hansen, John H. L.

doi:10.21437/interspeech.2019-1827

Cited by 15 publications

(12 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Row (a) is for the baseline separation model TasNet-v2 [4] we used throughout this work. Rows (b)(c) are respectively the latest version of TasNet and our implementation using Prob-PIT [14] with TasNet-v2. Row (d) is for our previous work of cross-domain joint clustering which was not used here at all.…”

Section: Summary Of the Resultsmentioning

confidence: 99%

Interrupted and Cascaded Permutation Invariant Training for Speech Separation

Yang¹,

Wu²,

Mao³

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Permutation Invariant Training (PIT) has long been a stepping stone method for training speech separation model in handling the label ambiguity problem. With PIT selecting the minimum cost label assignments dynamically, very few studies considered the separation problem to be optimizing both the model parameters and the label assignments, but focused on searching for good model architecture and parameters. In this paper, we investigate instead for a given model architecture the various flexible label assignment strategies for training the model, rather than directly using PIT. Surprisingly, we discover a significant performance boost compared to PIT is possible if the model is trained with fixed label assignments and a good set of labels is chosen. With fixed label training cascaded between two sections of PIT, we achieved the stateof-the-art performance on WSJ0-2mix without changing the model architecture at all.

show abstract

Section: Summary Of the Resultsmentioning

confidence: 99%

Interrupted and Cascaded Permutation Invariant Training for Speech Separation

Yang¹,

Wu²,

Mao³

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Prob-PIT [7] considers the probabilities of all utterance level permutations, rather than just the best one, improving the initial training stage when wrong alignments are likely to happen. A similar idea is employed by Yang et al [8], who trained a Conv-TasNet with uPIT and fixed alignments in turns, reporting 17.5 dB SI-SNRi.…”

Section: Additional Discussion: Previous Results On Wsj0-2mixmentioning

confidence: 99%

“…However, the permutation frequently changes over frames at inference time. Improvements to PIT roughly fall into two categories: (i) designing a permutation (or speaker) tracking algorithm for tPIT [2,4,5]; and (ii) designing better uPIT objectives to further strengthen permutation consistency [6][7][8][9]. Along these two lines, our work takes a close look at tPIT+clustering, a recent idea introduced by Deep CASA [2], that targets at accurate frame level separation (tPIT) and speaker tracking (clustering) in two stages.…”

Section: Introductionmentioning

confidence: 99%

On Permutation Invariant Training For Speech Source Separation

Liu

Pons

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We study permutation invariant training (PIT), which targets at the permutation ambiguity problem for speaker independent source separation models. We extend two state-of-the-art PIT strategies. First, we look at the two-stage speaker separation and tracking algorithm based on frame level PIT (tPIT) and clustering, which was originally proposed for the STFT domain, and we adapt it to work with waveforms and over a learned latent space. Further, we propose an efficient clustering loss scalable to waveform models. Second, we extend a recently proposed auxiliary speaker-ID loss with a deep feature loss based on "problem agnostic speech features", to reduce the local permutation errors made by the utterance level PIT (uPIT). Our results show that the proposed extensions help reducing permutation ambiguity. However, we also note that the studied STFTbased models are more effective at reducing permutation errors than waveform-based models, a perspective overlooked in recent studies.

show abstract

“…In this work we use a multi-speaker, sentence-based corpus called GRID, which has been used in monaural speech separation and recognition challenge [18]. Also, this dataset has been used in several studies [12,19] for overlapping speech detection and separation. This corpus contains 34 speakers, 16 female and 18 male speakers, each narrating 1000 sentence.…”

Section: Problem Formulationmentioning

confidence: 99%

Frame-Based Overlapping Speech Detection Using Convolutional Neural Networks

Yousefi

Hansen

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Naturalistic speech recordings usually contain speech signals from multiple speakers. This phenomenon can degrade the performance of speech technologies due to the complexity of tracing and recognizing individual speakers. In this study, we investigate the detection of overlapping speech on segments as short as 25 ms using Convolutional Neural Networks. We evaluate the detection performance using different spectral features, and show that pyknogram features outperforms other commonly used speech features. The proposed system can predict overlapping speech with an accuracy of 84% and Fscore of 88% on a dataset of mixed speech generated based on the GRID dataset.

show abstract

Probabilistic Permutation Invariant Training for Speech Separation

Cited by 15 publications

References 37 publications

Interrupted and Cascaded Permutation Invariant Training for Speech Separation

Interrupted and Cascaded Permutation Invariant Training for Speech Separation

On Permutation Invariant Training For Speech Source Separation

Frame-Based Overlapping Speech Detection Using Convolutional Neural Networks

Contact Info

Product

Resources

About