Separate But Together: Unsupervised Federated Learning for Speech Enhancement from Non-IID Data

Tzinis, Efthymios; Casebeer, Jonah; Wang, Zhepei; Smaragdis, Paris

doi:10.1109/waspaa52581.2021.9632783

Cited by 14 publications

(9 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…If the noise sources are independent from each other and the clean speech component, then the model can learn to minimize this loss by reconstructing the mixture using its first estimated slot and either one of the two noise slots available. Although MixIT has been proven effective for various simulated speech enhancement setups [29], [30], the assumption about having access to a diverse set of in-domain noise recordings from D n which aptly captures the true distribution of the present background noises D * n make it impractical for many real-world settings. To this end, other works [24], [31] have tried to deal with the distribution shift between the on-hand noise dataset D n and the actual noise distribution D * n in order to avoid the need of in-domain noise samples.…”

Section: B Mixture Invariant Training (Mixit)mentioning

confidence: 99%

“…The clean speech samples are drawn from the LibriSpeech [51] corpus and the noise recordings are taken from FSD50K [52] representing a set of almost 200 classes of background noises after excluding all the human-made sounds from the AudioSet ontology [53]. A detailed recipe of the dataset generation process is presented in [30]. LFSD becomes an ideal candidate for semi-supervised/SSL teacher pre-training on OOD data given its mixture diversity.…”

Section: Dns-challenge (Dns)mentioning

confidence: 99%

“…: The generation process for this dataset produces 20,000 training noisy-speech pairs and 3,000 test mixtures from the initial WHAM! [54] dataset and has been identical to the procedure followed in [30] with active noise sources mixed at an average of −1.3dB input SNR. The set of background noises in WHAM!…”

Section: Dns-challenge (Dns)mentioning

confidence: 99%

“…Mixture invariant training (MixIT) [28] enables unsupervised training of separation models only with real-world singlechannel recordings by generating artificial mixtures of mixtures and estimating the independent sources. Although MixIT has been proven successful for various speech enhancement tasks [28], [29], [30], MixIT assumes access to in-domain noise samples which restricts its universal applicability. Overcoming the latter constraint by injecting additional out-of-domain (OOD) noise sources to the input mixture of mixtures [31] further alters the input signal-to-noise ratio (SNR) distribution and its performance depends heavily on the distribution shift between the injected and real noise distributions.…”

mentioning

confidence: 99%

See 3 more Smart Citations

RemixIT: Continual Self-Training of Speech Enhancement Models via Bootstrapped Remixing

Tzinis

Adi

Ithapu

et al. 2022

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

We present RemixIT, a simple yet effective selfsupervised method for training speech enhancement without the need of a single isolated in-domain speech nor a noise waveform. Our approach overcomes limitations of previous methods which make them dependent on clean in-domain target signals and thus, sensitive to any domain mismatch between train and test samples. RemixIT is based on a continuous self-training scheme in which a pre-trained teacher model on out-of-domain data infers estimated pseudo-target signals for in-domain mixtures. Then, by permuting the estimated clean and noise signals and remixing them together, we generate a new set of bootstrapped mixtures and corresponding pseudo-targets which are used to train the student network.Vice-versa, the teacher periodically refines its estimates using the updated parameters of the latest student models. Experimental results on multiple speech enhancement datasets and tasks not only show the superiority of our method over prior approaches but also showcase that RemixIT can be combined with any separation model as well as be applied towards any semi-supervised and unsupervised domain adaptation task. Our analysis, paired with empirical evidence, sheds light on the inside functioning of our selftraining scheme wherein the student model keeps obtaining better performance while observing severely degraded pseudo-targets.

show abstract

Section: B Mixture Invariant Training (Mixit)mentioning

confidence: 99%

Section: Dns-challenge (Dns)mentioning

confidence: 99%

Section: Dns-challenge (Dns)mentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

RemixIT: Continual Self-Training of Speech Enhancement Models via Bootstrapped Remixing

Tzinis

Adi

Ithapu

et al. 2022

IEEE J. Sel. Top. Signal Process.

View full text Add to dashboard Cite

show abstract

“…We sample the clips in the development set of FSD50k to simulate the noises for training and validation, and those in the evaluation set to simulate the noises for testing. Since our task is single-speaker speech enhancement, following [50] we filter out clips containing any sounds produced by humans, based on the provided sound event annotation of each clip. Such clips have annotations such as Human voice, Male speech and man speaking, Chuckle and chortle, Yell, etc 2 .…”

Section: A Dataset For Noisy-reverberant Speech Enhancementmentioning

confidence: 99%

STFT-Domain Neural Speech Enhancement with Very Low Algorithmic Latency

Wang¹,

Wichern²,

Watanabe³

et al. 2022

Preprint

View full text Add to dashboard Cite

Deep learning based speech enhancement in the short-term Fourier transform (STFT) domain typically uses a large window length such as 32 ms. A larger window contains more samples and the frequency resolution can be higher for potentially better enhancement. This however incurs an algorithmic latency of 32 ms in an online setup, because the overlap-add algorithm used in the inverse STFT (iSTFT) is also performed based on the same 32 ms window size. To reduce this inherent latency, we adapt a conventional dual window size approach, where a regular input window size is used for STFT but a shorter output window is used for the overlap-add in the iSTFT, for STFTdomain deep learning based frame-online speech enhancement. Based on this STFT and iSTFT configuration, we employ singleor multi-microphone complex spectral mapping for frame-online enhancement, where a deep neural network (DNN) is trained to predict the real and imaginary (RI) components of target speech from the mixture RI components. In addition, we use the RI components predicted by the DNN to conduct frameonline beamforming, the results of which are then used as extra features for a second DNN to perform frame-online post-filtering. The frequency-domain beamforming in between the two DNNs can be easily integrated with complex spectral mapping and is designed to not incur any algorithmic latency. Additionally, we propose a future-frame prediction technique to further reduce the algorithmic latency. Evaluation results on a noisy-reverberant speech enhancement task demonstrate the effectiveness of the proposed algorithms. Compared with Conv-TasNet, our STFTdomain system can achieve better enhancement performance for a comparable amount of computation, or comparable performance with less computation, maintaining strong performance at an algorithmic latency as low as 2 ms.

show abstract

Federated Learning on Multimodal Data: A Comprehensive Survey

et al. 2023

View full text Add to dashboard Cite

With the growing awareness of data privacy, federated learning (FL) has gained increasing attention in recent years as a major paradigm for training models with privacy protection in mind, which allows building models in a collaborative but private way without exchanging data. However, most FL clients are currently unimodal. With the rise of edge computing, various types of sensors and wearable devices generate a large amount of data from different modalities, which has inspired research efforts in multimodal federated learning (MMFL). In this survey, we explore the area of MMFL to address the fundamental challenges of FL on multimodal data. First, we analyse the key motivations for MMFL. Second, the currently proposed MMFL methods are technically classified according to the modality distributions and modality annotations in MMFL. Then, we discuss the datasets and application scenarios of MMFL. Finally, we highlight the limitations and challenges of MMFL and provide insights and methods for future research.

show abstract

Separate But Together: Unsupervised Federated Learning for Speech Enhancement from Non-IID Data

Cited by 14 publications

References 31 publications

RemixIT: Continual Self-Training of Speech Enhancement Models via Bootstrapped Remixing

RemixIT: Continual Self-Training of Speech Enhancement Models via Bootstrapped Remixing

STFT-Domain Neural Speech Enhancement with Very Low Algorithmic Latency

Federated Learning on Multimodal Data: A Comprehensive Survey

Contact Info

Product

Resources

About