ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413403
|View full text |Cite
|
Sign up to set email alerts
|

DEAAN: Disentangled Embedding and Adversarial Adaptation Network for Robust Speaker Representation Learning

Abstract: Despite speaker verification has achieved significant performance improvement with the development of deep neural networks, domain mismatch is still a challenging problem in this field. In this study, we propose a novel framework to disentangle speaker-related and domain-specific features and apply domain adaptation on the speaker-related feature space solely. Instead of performing domain adaptation directly on the feature space where domain information is not removed, using disentanglement can efficiently boo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
6
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 15 publications
(6 citation statements)
references
References 28 publications
0
6
0
Order By: Relevance
“…To address this problem, the authors of [53], [54] minimized the maximum mean discrepancy (MMD) [55] across different languages to create a language-invariant speaker embedding space. Besides, domain adversarial training [56] has been successfully applied [57]- [61] to produce language-invariant speaker embeddings. There are also DA methods that directly adapt the PLDA covariance matrices to match the target distribution, e.g., CORAL+ [62] and Kaldi's PLDA adaptation [63].…”
Section: Challenges In Speaker Verificationmentioning
confidence: 99%
“…To address this problem, the authors of [53], [54] minimized the maximum mean discrepancy (MMD) [55] across different languages to create a language-invariant speaker embedding space. Besides, domain adversarial training [56] has been successfully applied [57]- [61] to produce language-invariant speaker embeddings. There are also DA methods that directly adapt the PLDA covariance matrices to match the target distribution, e.g., CORAL+ [62] and Kaldi's PLDA adaptation [63].…”
Section: Challenges In Speaker Verificationmentioning
confidence: 99%
“…Recently, speaker embeddings learned by deep neural network (DNN)-based architectures such as x-vector and ResNet have shown more impressive performance on SR than the previous state-of-the-art method, the i-vector-based approach [2]. The DNN-based approach can extract speaker-discriminative and robust speaker embeddings by training on various utterances from a large-scale SR dataset [3]. However, DNN-based speaker-discriminative embeddings do not represent the speech feature itself and cannot be effectively used as the feature vector to analyze speech.…”
Section: Introductionmentioning
confidence: 99%
“…For instance, joint factor embedding (JFE) [18] framework simultaneously extracts speaker and nuisance (i.e., non-speaker) embeddings and maximizes entropy (or uncertainty) on their opposite task, while minimizing correlation between two embeddings using mean absolute Pearson's correlation (MAPC) computed batch-wise. Similarly, [19], [20] divided features into the speaker and residual embeddings and increased their uncertainty on the contrary task, and [21] minimized mutual information via mutual information neural estimator (MINE) with GRL. Additionally, they adopted an autoencoder framework for training merged embedding to maintain the complete information of input speech [19]- [21].…”
Section: Introductionmentioning
confidence: 99%
“…Similarly, [19], [20] divided features into the speaker and residual embeddings and increased their uncertainty on the contrary task, and [21] minimized mutual information via mutual information neural estimator (MINE) with GRL. Additionally, they adopted an autoencoder framework for training merged embedding to maintain the complete information of input speech [19]- [21]. However, naively increasing uncertainty on the other task does not guarantee disentanglement.…”
Section: Introductionmentioning
confidence: 99%