Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-2205
|View full text |Cite
|
Sign up to set email alerts
|

x-Vector DNN Refinement with Full-Length Recordings for Speaker Recognition

Abstract: State-of-the-art text-independent speaker recognition systems for long recordings (a few minutes) are based on deep neural network (DNN) speaker embeddings. Current implementations of this paradigm use short speech segments (a few seconds) to train the DNN. This introduces a mismatch between training and inference when extracting embeddings for long duration recordings. To address this, we present a DNN refinement approach that updates a subset of the DNN parameters with full recordings to reduce this mismatch… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
18
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
3

Relationship

0
10

Authors

Journals

citations
Cited by 38 publications
(18 citation statements)
references
References 21 publications
0
18
0
Order By: Relevance
“…The code for this paper can be found at: https://github.com/clovaai/voxceleb_trainer popular due to their ease of implementation and good performance [17,18,19,20,21,22,23,24]. However, training with AM-Softmax and AAM-Softmax has proven to be challenging since they are sensitive to the value of scale and margin in the loss function.…”
Section: Introductionmentioning
confidence: 99%
“…The code for this paper can be found at: https://github.com/clovaai/voxceleb_trainer popular due to their ease of implementation and good performance [17,18,19,20,21,22,23,24]. However, training with AM-Softmax and AAM-Softmax has proven to be challenging since they are sensitive to the value of scale and margin in the loss function.…”
Section: Introductionmentioning
confidence: 99%
“…Moreover, this can potentially correct the duration mismatch between training and test conditions [15]. An effective method to decrease GPU memory requirements and to prevent overfitting when training with longer length utterances is to freeze the pre-pooling layers of the model [16]. However, we argue this can prevent these layers from sufficiently adapting to the increased duration condition, especially when such layers share global statistics through the SE-blocks in the ECAPA-TDNN architecture.…”
Section: Fine-tuning Configurationmentioning
confidence: 99%
“…It was found that the lower dimension of segment 6 and 7 helped in Speaker Verification in the case of 5-second-long utterances, but achieved higher EER on the original long utterances on the NIST SRE 2010 dataset. On the other hand, Garcia-Romero et al [32] tried to optimize the x-vector system for long utterances (with 2-4 seconds duration) by a DNN refinement approach that updates a subset of the DNN parameters with full recordings and modifies the DNN architecture to produce embeddings optimized for cosine distance scoring. The results show that the method produces lower minDCF (minimum Decision Cost Function), but slightly higher EER than the baseline x-vector approach.…”
Section: The X-vectormentioning
confidence: 99%