An Empirical Study of Conv-Tasnet

Kadıoğlu, Berkan; Horgan, Michael A.; Liu, Xiaoyu; Pons, Jordi; Darcy, Dan; Kumar, Vivek

doi:10.1109/icassp40776.2020.9054721

“…The training set was created by randomly mixing utterances from 100 speakers at randomly selected SNRs between 0 and 5 dB. Previous works found that models trained on WSJ0-2mix might not generalize well to other datasets [13,29]. To also evaluate model generalization, we tested our models not only on the WSJ0-2mix test set (16 unseen speakers), but also on the recently released Libri-2mix (40 speakers) and VCTK-2mix (108 speakers) test sets [29].…”

Section: Methodsmentioning

confidence: 99%

“…We also explore another promising method based on uPIT+speaker-ID loss [9], that introduces an additional deep feature loss term (speaker-ID) to help uPIT reducing local speaker swaps. In this paper, we extend these two training strategies for Conv-TasNet [10], a fully convolutional version of TasNet [10][11][12][13] that models speaker separation in the waveform domain.…”

Section: Introductionmentioning

confidence: 99%

On Permutation Invariant Training For Speech Source Separation

Liu

¹

,

Pons

²

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

6

0

View full text Add to dashboard Cite

We study permutation invariant training (PIT), which targets at the permutation ambiguity problem for speaker independent source separation models. We extend two state-of-the-art PIT strategies. First, we look at the two-stage speaker separation and tracking algorithm based on frame level PIT (tPIT) and clustering, which was originally proposed for the STFT domain, and we adapt it to work with waveforms and over a learned latent space. Further, we propose an efficient clustering loss scalable to waveform models. Second, we extend a recently proposed auxiliary speaker-ID loss with a deep feature loss based on "problem agnostic speech features", to reduce the local permutation errors made by the utterance level PIT (uPIT). Our results show that the proposed extensions help reducing permutation ambiguity. However, we also note that the studied STFTbased models are more effective at reducing permutation errors than waveform-based models, a perspective overlooked in recent studies.

show abstract

“…The training set was created by randomly mixing utterances from 100 speakers at randomly selected SNRs between 0 and 5 dB. Previous works found that models trained on WSJ0-2mix might not generalize well to other datasets [13,29]. To also evaluate model generalization, we tested our models not only on the WSJ0-2mix test set (16 unseen speakers), but also on the recently released Libri-2mix (40 speakers) and VCTK-2mix (108 speakers) test sets [29].…”

Section: Methodsmentioning

confidence: 99%

“…We also explore another promising method based on uPIT+speaker-ID loss [9], that introduces an additional deep feature loss term (speaker-ID) to help uPIT reducing local speaker swaps. In this paper, we extend these two training strategies for Conv-TasNet [10], a fully convolutional version of TasNet [10][11][12][13] that models speaker separation in the waveform domain.…”

Section: Introductionmentioning

confidence: 99%

On permutation invariant training for speech source separation

Liu

¹

,

Pons

²

2021

Preprint

Self Cite

0

View full text Add to dashboard Cite

We study permutation invariant training (PIT), which targets at the permutation ambiguity problem for speaker independent source separation models. We extend two state-of-the-art PIT strategies. First, we look at the two-stage speaker separation and tracking algorithm based on frame level PIT (tPIT) and clustering, which was originally proposed for the STFT domain, and we adapt it to work with waveforms and over a learned latent space. Further, we propose an efficient clustering loss scalable to waveform models. Second, we extend a recently proposed auxiliary speaker-ID loss with a deep feature loss based on "problem agnostic speech features", to reduce the local permutation errors made by the utterance level PIT (uPIT). Our results show that the proposed extensions help reducing permutation ambiguity. However, we also note that the studied STFTbased models are more effective at reducing permutation errors than waveform-based models, a perspective overlooked in recent studies.

show abstract

“…It is challenging to define the reference objects for object-based supervised learning. The problem of permutation ambiguity, described in the speech [7][8][9] and universal [3,10] source separation literature, also arises here. The output to ground…”

Section: Introductionmentioning

confidence: 99%

Multichannel-based Learning for Audio Object Extraction

Arteaga

¹

,

Pons

²

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

4

0

View full text Add to dashboard Cite

The current paradigm for creating and deploying immersive audio content is based on audio objects, which are composed of an audio track and position metadata. While rendering an object-based production into a multichannel mix is straightforward, the reverse process involves sound source separation and estimating the spatial trajectories of the extracted sources. Besides, cinematic object-based productions are often composed by dozens of simultaneous audio objects, which poses a scalability challenge for audio object extraction.Here, we propose a novel deep learning approach to object extraction that learns from the multichannel renders of object-based productions, instead of directly learning from the audio objects themselves. This approach allows tackling the object scalability challenge and also offers the possibility to formulate the problem in a supervised or an unsupervised fashion. Since, to our knowledge, no other works have previously addressed this topic, we first define the task and propose an evaluation methodology, and then discuss under what circumstances our methods outperform the proposed baselines.

show abstract

An Empirical Study of Conv-Tasnet

Cited by 31 publications

References 26 publications

On Permutation Invariant Training For Speech Source Separation

On Permutation Invariant Training For Speech Source Separation

On permutation invariant training for speech source separation

Multichannel-based Learning for Audio Object Extraction

Contact Info

Product

Resources

About