Within-Sample Variability-Invariant Loss for Robust Speaker Recognition Under Noisy Environments

Cai, Danwei; Cai, Weicheng; Li, Ming

doi:10.1109/icassp40776.2020.9053407

Cited by 42 publications

(21 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Data augmentation is proven to be an effective strategy for both conventional learning with supervision [24] and contrastive selfsupervision learning [7,8,15] in the context of deep learning. We perform data augmentation with MUSAN dataset [25].…”

Section: Data Augmentationmentioning

confidence: 99%

See 1 more Smart Citation

An Iterative Framework for Self-Supervised Deep Speaker Representation Learning

Cai

Wang

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

In this paper, we propose an iterative framework for self-supervised speaker representation learning based on a deep neural network (DNN). The framework starts with training a self-supervision speaker embedding network by maximizing agreement between different segments within an utterance via a contrastive loss. Taking advantage of DNN's ability to learn from data with label noise, we propose to cluster the speaker embedding obtained from the previous speaker network and use the subsequent class assignments as pseudo labels to train a new DNN. Moreover, we iteratively train the speaker network with pseudo labels generated from the previous step to bootstrap the discriminative power of a DNN. Speaker verification experiments are conducted on the VoxCeleb dataset. The results show that our proposed iterative self-supervised learning framework outperformed previous works using self-supervision. The speaker network after 5 iterations obtains a 61% performance gain over the speaker embedding model trained with contrastive loss.

show abstract

Section: Data Augmentationmentioning

confidence: 99%

“…We use the same network architecture as in [24]. ReLU nonlinear activation and batch normalization are applied to each convolutional layer in ResNet.…”

Section: Contrastive Self-supervised Learning Setupmentioning

confidence: 99%

An Iterative Framework for Self-Supervised Deep Speaker Representation Learning

Cai

Wang

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Channel-invariant training. Inspired by [19], we propose channel-invariant loss which is used to force the embedding of augmented segments as similar as its clean version, preventing the network from encoding the undesired channel information into the speaker representation. The model can learn to filter out channel factors.…”

Section: Remove Channel Informationmentioning

confidence: 99%

Contrastive Self-Supervised Learning for Text-Independent Speaker Verification

Zhang

Zou

Wang

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Current speaker verification models rely on supervised training with massive annotated data. But the collection of labeled utterances from multiple speakers is expensive and facing privacy issues. To open up an opportunity for utilizing massive unlabeled utterance data, our work exploits a contrastive selfsupervised learning (CSSL) approach for text-independent speaker verification task. The core principle of CSSL lies in minimizing the distance between the embeddings of augmented segments truncated from the same utterance as well as maximizing those from different utterances. We proposed channel-invariant loss to prevent the network from encoding the undesired channel information into the speaker representation. Bearing these in mind, we conduct intensive experiments on VoxCeleb1&2 datasets. The self-supervised thin-ResNet34 fine-tuned with only 5% of the labeled data can achieve comparable performance to the fully supervised model, which is meaningful to economize lots of manual annotation.

show abstract

“…For input-level, models usually can be adapted by training with enhanced [8] or domain-translated [9] input features. For adaptation at embedding-level, it often targets at minimizing certain distances between source and target domains to align them in the same embedding space, such as cosine distance [10], mean squared error (MSE) [11], and maximum mean discrepancy (MMD) [12]. However, this method usually requires parallel or artificial simulated data, which cannot generalize well to real-world scenarios.…”

Section: Introductionmentioning

confidence: 99%

DEAAN: Disentangled Embedding and Adversarial Adaptation Network for Robust Speaker Representation Learning

Sang¹,

Xia²,

Hansen³

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Despite speaker verification has achieved significant performance improvement with the development of deep neural networks, domain mismatch is still a challenging problem in this field. In this study, we propose a novel framework to disentangle speaker-related and domain-specific features and apply domain adaptation on the speaker-related feature space solely. Instead of performing domain adaptation directly on the feature space where domain information is not removed, using disentanglement can efficiently boost adaptation performance. To be specific, our model's input speech from the source and target domains is first encoded into different latent feature spaces. The adversarial domain adaptation is conducted on the shared speaker-related feature space to encourage the property of domain-invariance. Further, we minimize the mutual information between speaker-related and domain-specific features for both domains to enforce the disentanglement. Experimental results on the VOiCES dataset demonstrate that our proposed framework can effectively generate more speaker-discriminative and domain-invariant speaker representations with a relative 20.3% reduction of EER compared to the original ResNet-based system.

show abstract

Within-Sample Variability-Invariant Loss for Robust Speaker Recognition Under Noisy Environments

Cited by 42 publications

References 28 publications

An Iterative Framework for Self-Supervised Deep Speaker Representation Learning

An Iterative Framework for Self-Supervised Deep Speaker Representation Learning

Contrastive Self-Supervised Learning for Text-Independent Speaker Verification

DEAAN: Disentangled Embedding and Adversarial Adaptation Network for Robust Speaker Representation Learning

Contact Info

Product

Resources

About