Semi-FedSER: Semi-supervised Learning for Speech Emotion Recognition On Federated Learning using Multiview Pseudo-Labeling

Feng, Tao; Narayanan, Shrikanth

doi:10.48550/arxiv.2203.08810

Cited by 1 publication

(1 citation statement)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are a handful of attempts in literature for applying FL in speech-related tasks. Some of these applications are: ASR [10,11,12,13,14], Keyword Spotting [15,16], Emotion Recognition [17,18,16], and Speaker Verification [19]. Notably, for combining FL with SSL, the only available works include Federated self-supervised learning (FSSL) [20] for acoustic event detection and [21], where the challenges involved in combining FL & SSL due to hardware limitations on the client are highlighted and a wav2vec 2.0 [4] model is trained with FL on Common-Voice Italian data [22] and fine-tuned for ASR.…”

Section: Related Workmentioning

confidence: 99%

Federated Representation Learning for Automatic Speech Recognition

Ramesh,

Chennupati,

Rao

et al. 2023

3rd Symposium on Security and Privacy in Speech Communication

View full text Add to dashboard Cite

Federated Learning (FL) is a privacy-preserving paradigm, allowing edge devices to learn collaboratively without sharing data. Edge devices like Alexa and Siri are prospective sources of unlabeled audio data that can be tapped to learn robust audio representations. In this work, we bring Self-supervised Learning (SSL) and FL together to learn representations for Automatic Speech Recognition respecting data privacy constraints. We use the speaker and chapter information in the unlabeled speech dataset, Libri-Light, to simulate non-IID speaker-siloed data distributions and pre-train an LSTM encoder with the Contrastive Predictive Coding framework with FedSGD. We show that the pre-trained ASR encoder in FL performs as well as a centrally pre-trained model and produces an improvement of 12-15% (WER) compared to no pre-training. We further adapt the federated pre-trained models to a new language, French, and show a 20% (WER) improvement over no pre-training.

show abstract