Seeing Voices and Hearing Voices: Learning Discriminative Embeddings Using Cross-Modal Self-Supervision

Chung, Soo-Whan; Kang, Hyun Joo; Chung, Joon Son

doi:10.21437/interspeech.2020-1113

Cited by 36 publications

(13 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This section compares the self-supervised speaker verification results on the VoxCeleb2 training dataset between the conventional methods and the proposed two-stage framework. The conventional methods contain Disent [48], CDDL [49], GCL [19], I-vector [50] and AAT [20], Chnl [21], ProNCE [22] techniques with the corresponding contrastive loss (i.e., Prot, AProt, SimCLR, MoCo, and ACont). Disent.…”

Section: Results: Uncertainty-aware Probabilistic Speaker Embedding Training Strategy 1) Estimation Of Data Uncertainty Via the Back-end mentioning

confidence: 99%

Bootstrap Equilibrium and Probabilistic Speaker Representation Learning for Self-Supervised Speaker Verification

Mun¹,

Han

Lee

et al. 2021

IEEE Access

View full text Add to dashboard Cite

In this paper, we propose self-supervised speaker representation learning strategies, which comprise of a bootstrap equilibrium speaker representation learning in the front-end and an uncertaintyaware probabilistic speaker embedding training in the back-end. In the front-end stage, we learn the speaker representations via the bootstrap training scheme with the uniformity regularization term. In the backend stage, the probabilistic speaker embeddings are estimated by maximizing the mutual likelihood score between the speech samples belonging to the same speaker, which provide not only speaker representations but also data uncertainty. Experimental results show that the proposed bootstrap equilibrium training strategy can effectively help learn the speaker representations and outperforms the conventional methods based on contrastive learning. Also, we demonstrate that the integrated two-stage framework further improves the speaker verification performance on the VoxCeleb1 test set in terms of EER and MinDCF.INDEX TERMS Speaker verification, self-supervised learning, bootstrap representation learning, probabilistic speaker embedding.

show abstract

Section: Results: Uncertainty-aware Probabilistic Speaker Embedding Training Strategy 1) Estimation Of Data Uncertainty Via the Back-end mentioning

confidence: 99%

Bootstrap Equilibrium and Probabilistic Speaker Representation Learning for Self-Supervised Speaker Verification

Mun¹,

Han

Lee

et al. 2021

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Cosine Distance Scoring (CDS) is applied to evaluate the performance. Methods [23,25,30,31] are some recent unsupervised neural methods for speaker representation learning. As shown in the table, we can obtain a 15.28% EER using the i-Vector method and 15.11% EER using the MoCo speaker embedding system.…”

Section: Unlabeled Conditionmentioning

confidence: 99%

Self-Supervised Text-Independent Speaker Verification Using Prototypical Momentum Contrastive Learning

Xia

Zhang

Weng

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this study, we investigate self-supervised representation learning for speaker verification (SV). First, we examine a simple contrastive learning approach (SimCLR) with a momentum contrastive (MoCo) learning framework, where the MoCo speaker embedding system utilizes a queue to maintain a large set of negative examples. We show that better speaker embeddings can be learned by momentum contrastive learning. Next, alternative augmentation strategies are explored to normalize extrinsic speaker variabilities of two random segments from the same speech utterance. Specifically, augmentation in the waveform largely improves the speaker representations for SV tasks. The proposed MoCo speaker embedding is further improved when a prototypical memory bank is introduced, which encourages the speaker embeddings to be closer to their assigned prototypes with an intermediate clustering step. In addition, we generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled. Comprehensive experiments on the Voxceleb dataset demonstrate that our proposed selfsupervised approach achieves competitive performance compared with existing techniques, and can approach fully supervised results with partially labeled data.

show abstract

“…For the word-level classification, mixup with a weight of 0.4 is employed. Self-supervised 71.6 PT-CDDL [8] Self-supervised 75.9 AV-PPC [38] Self-supervised 84.8…”

Section: Data Augmentationmentioning

confidence: 99%

LiRA: Learning Visual Speech Representations from Audio Through Self-Supervision

Mira²,

Petridis

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

The large amount of audiovisual content being shared online today has drawn substantial attention to the prospect of audiovisual self-supervised learning. Recent works have focused on each of these modalities separately, while others have attempted to model both simultaneously in a cross-modal fashion. However, comparatively little attention has been given to leveraging one modality as a training objective to learn from the other. In this work, we propose Learning visual speech Representations from Audio via self-supervision (LiRA). Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech. We find that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading through feature extraction and fine-tuning experiments. We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild (LRW) dataset and achieves state-of-the-art performance on Lip Reading Sentences 2 (LRS2) using only a fraction of the total labelled data.

show abstract

Seeing Voices and Hearing Voices: Learning Discriminative Embeddings Using Cross-Modal Self-Supervision

Cited by 36 publications

References 35 publications

Bootstrap Equilibrium and Probabilistic Speaker Representation Learning for Self-Supervised Speaker Verification

Bootstrap Equilibrium and Probabilistic Speaker Representation Learning for Self-Supervised Speaker Verification

Self-Supervised Text-Independent Speaker Verification Using Prototypical Momentum Contrastive Learning

LiRA: Learning Visual Speech Representations from Audio Through Self-Supervision

Contact Info

Product

Resources

About