Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1113
|View full text |Cite
|
Sign up to set email alerts
|

Seeing Voices and Hearing Voices: Learning Discriminative Embeddings Using Cross-Modal Self-Supervision

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
13
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 36 publications
(13 citation statements)
references
References 35 publications
0
13
0
Order By: Relevance
“…This section compares the self-supervised speaker verification results on the VoxCeleb2 training dataset between the conventional methods and the proposed two-stage framework. The conventional methods contain Disent [48], CDDL [49], GCL [19], I-vector [50] and AAT [20], Chnl [21], ProNCE [22] techniques with the corresponding contrastive loss (i.e., Prot, AProt, SimCLR, MoCo, and ACont). Disent.…”
Section: Results: Uncertainty-aware Probabilistic Speaker Embedding Training Strategy 1) Estimation Of Data Uncertainty Via the Back-end mentioning
confidence: 99%
“…This section compares the self-supervised speaker verification results on the VoxCeleb2 training dataset between the conventional methods and the proposed two-stage framework. The conventional methods contain Disent [48], CDDL [49], GCL [19], I-vector [50] and AAT [20], Chnl [21], ProNCE [22] techniques with the corresponding contrastive loss (i.e., Prot, AProt, SimCLR, MoCo, and ACont). Disent.…”
Section: Results: Uncertainty-aware Probabilistic Speaker Embedding Training Strategy 1) Estimation Of Data Uncertainty Via the Back-end mentioning
confidence: 99%
“…Cosine Distance Scoring (CDS) is applied to evaluate the performance. Methods [23,25,30,31] are some recent unsupervised neural methods for speaker representation learning. As shown in the table, we can obtain a 15.28% EER using the i-Vector method and 15.11% EER using the MoCo speaker embedding system.…”
Section: Unlabeled Conditionmentioning
confidence: 99%
“…For the word-level classification, mixup with a weight of 0.4 is employed. Self-supervised 71.6 PT-CDDL [8] Self-supervised 75.9 AV-PPC [38] Self-supervised 84.8…”
Section: Data Augmentationmentioning
confidence: 99%