2020
DOI: 10.48550/arxiv.2003.11982
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

In defence of metric learning for speaker recognition

Joon Son Chung,
Jaesung Huh,
Seongkyu Mun
et al.

Abstract: The objective of this paper is 'open-set' speaker recognition of unseen speakers, where ideal embeddings should be able to condense information into a compact utterance-level representation that has small intra-class (same speaker) and large inter-class (different speakers) distance.A popular belief in speaker recognition is that networks trained with classification objectives outperform metric learning methods. In this paper, we present an extensive evaluation of most recent loss functions for speaker recogni… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
70
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 33 publications
(71 citation statements)
references
References 39 publications
1
70
0
Order By: Relevance
“…We denote the proposed model as STB-ASV. For the STB-ASV training, the network structure of its initial singlechannel ASV is the same as in [23], which contains three main components: a front-end residual convolution neural network (ResNet) [18], a self-attentive pooling (SAP) [24] layer and a fully-coonnected layer. It was trained for 200 epochs on the Librispeech corpus.…”
Section: Methodsmentioning
confidence: 99%
“…We denote the proposed model as STB-ASV. For the STB-ASV training, the network structure of its initial singlechannel ASV is the same as in [23], which contains three main components: a front-end residual convolution neural network (ResNet) [18], a self-attentive pooling (SAP) [24] layer and a fully-coonnected layer. It was trained for 200 epochs on the Librispeech corpus.…”
Section: Methodsmentioning
confidence: 99%
“…Front-end embedding extractor. The CNN model used as the front-end embedding extractor to train the householdadapted scoring model with VoxCeleb1 data is Half-ResNet34 [23,24], which has half of the channel numbers of the original ResNet34. The model was trained on VoxCeleb2 [8] with 5994 speakers for 100 epochs on a single Tesla V100 Nvidia GPU.…”
Section: Model Trainingmentioning
confidence: 99%
“…For contrastive learning objectives, following the implementation in [22], we randomly sample M segments from each of N utterances, whose embeddings are xj,i where 1 ≤ j ≤ N and 1 ≤ i ≤ M . The segments sampled from the same utterance are considered from the same speaker and segments from different utterance are considered from different speaker.…”
Section: Contrastive Learning Objectives For Self-supervised Trainingmentioning
confidence: 99%
“…For prototypical loss (ProtoLoss), each mini-batch contains a support set S and a query set Q. Same as the implementation in [22], the M -th segment from each utterance is considered as query. Then the prototype (centroid) is defined as:…”
Section: Angular Prototypical Lossmentioning
confidence: 99%