LibriMix: An Open-Source Dataset for Generalizable Speech Separation

Cosentino, Joris; Pariente, Manuel; Cornell, Samuele; Deleforge, Antoine

doi:10.48550/arxiv.2005.11262

Cited by 47 publications

(78 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…S2VC includes several self-supervised representations, and here we adopted CPC [13] version since it was reported to perform the best. For SE, we chose off-the-shelf models pre-trained on different datasets: DEMUCS, on Valentini [14] and DNS [15]; MetricGAN+ [16], on VoiceBank-DEMAND [17]; and Conv-TasNet [18], on LibriMix [19].…”

Section: Modelsmentioning

confidence: 99%

Toward Degradation-Robust Voice Conversion

Huang¹,

Chang²,

Lee³

2021

Preprint

View full text Add to dashboard Cite

Any-to-any voice conversion technologies convert the vocal timbre of an utterance to any speaker even unseen during training. Although there have been several state-of-the-art any-to-any voice conversion models, they were all based on clean utterances to convert successfully. However, in real-world scenarios, it is difficult to collect clean utterances of a speaker, and they are usually degraded by noises or reverberations. It thus becomes highly desired to understand how these degradations affect voice conversion and build a degradationrobust model. We report in this paper the first comprehensive study on the degradation robustness of any-to-any voice conversion. We show that the performance of state-of-the-art models nowadays was severely hampered given degraded utterances. To this end, we then propose speech enhancement concatenation and denoising training to improve the robustness. In addition to common degradations, we also consider adversarial noises, which alter the model output significantly yet are human-imperceptible. It was shown that both concatenations with off-the-shelf speech enhancement models and denoising training on voice conversion models could improve the robustness, while each of them had pros and cons.

show abstract

Section: Modelsmentioning

confidence: 99%

Toward Degradation-Robust Voice Conversion

Huang¹,

Chang²,

Lee³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Libri2Mix [2]. This dataset was constructed using train-100, train-360, dev, and test set in the LibriSpeech dataset [25].…”

Section: Datasetmentioning

confidence: 99%

Speech Separation Using an Asynchronous Fully Recurrent Convolutional Neural Network

Hu¹,

Li²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recent advances in the design of neural network architectures, in particular those specialized in modeling sequences, have provided significant improvements in speech separation performance. In this work, we propose to use a bio-inspired architecture called Fully Recurrent Convolutional Neural Network (FRCNN) to solve the separation task. This model contains bottom-up, top-down and lateral connections to fuse information processed at various time-scales represented by stages. In contrast to the traditional approach updating stages in parallel, we propose to first update the stages one by one in the bottom-up direction, then fuse information from adjacent stages simultaneously and finally fuse information from all stages to the bottom stage together. Experiments showed that this asynchronous updating scheme achieved significantly better results with much fewer parameters than the traditional synchronous updating scheme. In addition, the proposed model achieved good balance between speech separation accuracy and computational efficiency as compared to other state-of-the-art models on three benchmark datasets.

show abstract

“…Cross-domain SS and TSE tasks: the English Libri2Mix [24] and Mandarin Aishell2Mix are used as the supervised source domain and unsupervised target domain dataset, respectively. Each mixture in Aishell2Mix is generated by mixing two speakers' utterances from Aishell-1 [25].…”

Section: Task Constructionmentioning

confidence: 99%

“…On the noisy and reverberant in-domain LibriSpeech dataset [23], the proposed DPCCN achieves more than 1.4 dB absolute SISNR improvement over all listed state-of-the-art time-domain speech separation methods. For the cross-domain speech separation and extraction tasks, we evaluate the proposed approaches on the clean Libri2Mix [24] and Aishell2Mix that created by ourselves from Aishell-1 [25] corpus. Extensive results show that the DPCCN-based systems are much more robust and achieve much better performance than baselines.…”

Section: Introductionmentioning

confidence: 99%

DPCCN: Densely-Connected Pyramid Complex Convolutional Network for Robust Speech Separation And Extraction

Han¹,

Long²,

Burget³

et al. 2021

Preprint

View full text Add to dashboard Cite

In recent years, more and more time-domain speech separation methods have been proposed. However, most of them are very sensitive to the environments and wide domain coverage tasks. In this paper, from the time-frequency domain perspective, we propose a densely-connected pyramid complex convolutional network, termed DPCCN, to improve the robustness of speech separation under complicated conditions. Furthermore, we generalize the DPCCN to target speech extraction (TSE) by integrating a new specially designed speaker encoder. Moreover, we also investigate the robustness of DPCCN to unsupervised cross-domain TSE tasks. A Mixture-Remix approach is proposed to adapt the target domain acoustic characteristics for fine-tuning the source model. We evaluate the proposed methods not only under noisy and reverberant in-domain condition, but also in clean but cross-domain conditions. Results show that either for speech separation or extraction, the DPCCN-based systems achieve much better performance and stronger robustness than the current dominant time-domain methods, especially for the cross-domain tasks. And particularly, we find that the Mixture-Remix fine-tuning with DPCCN significantly outperforms the TD-SpeakerBeam for unsupervised cross-domain TSE, with around 3.5 dB performance improvement on target domain test set while without any source domain performance degradation.

show abstract

LibriMix: An Open-Source Dataset for Generalizable Speech Separation

Cited by 47 publications

References 25 publications

Toward Degradation-Robust Voice Conversion

Toward Degradation-Robust Voice Conversion

Speech Separation Using an Asynchronous Fully Recurrent Convolutional Neural Network

DPCCN: Densely-Connected Pyramid Complex Convolutional Network for Robust Speech Separation And Extraction

Contact Info

Product

Resources

About