ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054721
|View full text |Cite
|
Sign up to set email alerts
|

An Empirical Study of Conv-Tasnet

Abstract: Conv-TasNet is a recently proposed waveform-based deep neural network that achieves state-of-the-art performance in speech source separation. Its architecture consists of a learnable encoder/decoder and a separator that operates on top of this learned space. Various improvements have been proposed to Conv-TasNet. However, they mostly focus on the separator, leaving its encoder/decoder as a (shallow) linear operator. In this paper, we conduct an empirical study of Conv-TasNet and propose an enhancement to the e… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
42
1

Year Published

2021
2021
2025
2025

Publication Types

Select...
4
3
1

Relationship

3
5

Authors

Journals

citations
Cited by 31 publications
(43 citation statements)
references
References 26 publications
0
42
1
Order By: Relevance
“…The training set was created by randomly mixing utterances from 100 speakers at randomly selected SNRs between 0 and 5 dB. Previous works found that models trained on WSJ0-2mix might not generalize well to other datasets [13,29]. To also evaluate model generalization, we tested our models not only on the WSJ0-2mix test set (16 unseen speakers), but also on the recently released Libri-2mix (40 speakers) and VCTK-2mix (108 speakers) test sets [29].…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…The training set was created by randomly mixing utterances from 100 speakers at randomly selected SNRs between 0 and 5 dB. Previous works found that models trained on WSJ0-2mix might not generalize well to other datasets [13,29]. To also evaluate model generalization, we tested our models not only on the WSJ0-2mix test set (16 unseen speakers), but also on the recently released Libri-2mix (40 speakers) and VCTK-2mix (108 speakers) test sets [29].…”
Section: Methodsmentioning
confidence: 99%
“…We also explore another promising method based on uPIT+speaker-ID loss [9], that introduces an additional deep feature loss term (speaker-ID) to help uPIT reducing local speaker swaps. In this paper, we extend these two training strategies for Conv-TasNet [10], a fully convolutional version of TasNet [10][11][12][13] that models speaker separation in the waveform domain.…”
Section: Introductionmentioning
confidence: 99%
“…The training set was created by randomly mixing utterances from 100 speakers at randomly selected SNRs between 0 and 5 dB. Previous works found that models trained on WSJ0-2mix might not generalize well to other datasets [13,29]. To also evaluate model generalization, we tested our models not only on the WSJ0-2mix test set (16 unseen speakers), but also on the recently released Libri-2mix (40 speakers) and VCTK-2mix (108 speakers) test sets [29].…”
Section: Methodsmentioning
confidence: 99%
“…We also explore another promising method based on uPIT+speaker-ID loss [9], that introduces an additional deep feature loss term (speaker-ID) to help uPIT reducing local speaker swaps. In this paper, we extend these two training strategies for Conv-TasNet [10], a fully convolutional version of TasNet [10][11][12][13] that models speaker separation in the waveform domain.…”
Section: Introductionmentioning
confidence: 99%
“…It is challenging to define the reference objects for object-based supervised learning. The problem of permutation ambiguity, described in the speech [7][8][9] and universal [3,10] source separation literature, also arises here. The output to ground…”
Section: Introductionmentioning
confidence: 99%