Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1986
|View full text |Cite
|
Sign up to set email alerts
|

End-to-End Losses Based on Speaker Basis Vectors and All-Speaker Hard Negative Mining for Speaker Verification

Abstract: In recent years, speaker verification has primarily performed using deep neural networks that are trained to output embeddings from input features such as spectrograms or Mel-filterbank energies. Studies that design various loss functions, including metric learning have been widely explored. In this study, we propose two end-to-end loss functions for speaker verification using the concept of speaker bases, which are trainable parameters. One loss function is designed to further increase the interspeaker variat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
22
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
2
1

Relationship

2
5

Authors

Journals

citations
Cited by 24 publications
(22 citation statements)
references
References 16 publications
(23 reference statements)
0
22
0
Order By: Relevance
“…To consider both inter-class and intra-class covariance, we utilize center loss [19] and speaker basis loss [20] in addition to categorical cross-entropy loss for DNN training. We adopt center loss [19] to minimize intra-class covariance while the embedding in the last hidden layer remains discriminative.…”
Section: Additional Objective Functions For Speaker Embedding Enhancementioning
confidence: 99%
See 1 more Smart Citation
“…To consider both inter-class and intra-class covariance, we utilize center loss [19] and speaker basis loss [20] in addition to categorical cross-entropy loss for DNN training. We adopt center loss [19] to minimize intra-class covariance while the embedding in the last hidden layer remains discriminative.…”
Section: Additional Objective Functions For Speaker Embedding Enhancementioning
confidence: 99%
“…where xi refers to embedding of the ith utterance, cy i refers to the center of class yi, and N refers to the size of a mini-batch. Speaker basis loss [20], aims to further maximize interclass covariance. This loss function considers a weight vector between the last hidden layer and a node of the softmax output layer as a basis vector for the corresponding speaker and is formulated as: [21] 11.3 x-vector(w augment) [21] 9.9 RawNet 4.8…”
Section: Additional Objective Functions For Speaker Embedding Enhancementioning
confidence: 99%
“…In this experiment, we compare the embedding vectors obtained from the proposed joint factor embedding scheme and the conventional x-vector framework along with techniques reported in recent studies including VGG-M, ResNet-34 and end-to-end verification systems [39], [40]. The experimented methods are as follows:…”
Section: ) Comparison Between the Joint Factor Embedding Scheme And mentioning
confidence: 99%
“…• VGG [39]: the performance of the embedding extracted from VGG-M, which is a CNN architecture known to perform well on image and speaker classification, reported in [39], • Generalized end-to-end [40]: the performance of the ResNet-34-based end-to-end speaker verification system trained with the generalized end-to-end loss (6) reported in [39], • All-speaker hard negative mining end-to-end [40]:…”
Section: ) Comparison Between the Joint Factor Embedding Scheme And mentioning
confidence: 99%
“…A slightly modified ResNet was used for modeling the spectrograms, accounting for different stride sizes for time and frequency domains due to high-resolution in the frequency domain, and the number of residual blocks was adjusted to fit the provided ASV2019 physical access dataset. The raw waveform CNN-GRU model, proposed in [17], was used with a few modifications: one less residual block, a different specified input utterance length at training phase to fit the dataset, and additional loss functions for training (center loss [25] and speaker basis loss [26]). This model first extracts 128-dimensional frame-level representations using 1-dimensional convolutional layers.…”
Section: Dnn Architecturementioning
confidence: 99%