Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1243
|View full text |Cite
|
Sign up to set email alerts
|

Teacher-Student MixIT for Unsupervised and Semi-Supervised Speech Separation

Abstract: In this paper, we introduce a novel semi-supervised learning framework for end-to-end speech separation. The proposed method first uses mixtures of unseparated sources and the mixture invariant training (MixIT) criterion to train a teacher model. The teacher model then estimates separated sources that are used to train a student model with standard permutation invariant training (PIT). The student model can be fine-tuned with supervised data, i.e., paired artificial mixtures and clean speech sources, and furth… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 13 publications
(2 citation statements)
references
References 26 publications
0
2
0
Order By: Relevance
“…As the model is trained to separate the MOMs into a variable number of latent sources, the separated sources can be remixed to approximate the original mixtures. Motivated by MixIT, authors in [31] proposed a teacher-student MixIT (TS-MixIT) to alleviate the over-separation problem in the original MixIT. It takes the unsupervised model trained by MixIT as a teacher model, then the estimated sources are filtered and selected as pseudotargets to further train a student model using standard permutation invariant training (PIT) [3].…”
Section: Introductionmentioning
confidence: 99%
“…As the model is trained to separate the MOMs into a variable number of latent sources, the separated sources can be remixed to approximate the original mixtures. Motivated by MixIT, authors in [31] proposed a teacher-student MixIT (TS-MixIT) to alleviate the over-separation problem in the original MixIT. It takes the unsupervised model trained by MixIT as a teacher model, then the estimated sources are filtered and selected as pseudotargets to further train a student model using standard permutation invariant training (PIT) [3].…”
Section: Introductionmentioning
confidence: 99%
“…Mixup [32]) but have also been successfully applied to several audio tasks [33], [34]. In [35], a student model with a smaller number of estimated sources has been trained on a subset of outputs of a pre-trained MixIT model to solve the input SNR distribution mismatch. Furthermore, a student model could also perform test-time adaptation by using the teacher's estimated waveforms as targets [36].…”
mentioning
confidence: 99%