x-Vector DNN Refinement with Full-Length Recordings for Speaker Recognition

Garcia‐Romero, Daniel; Snyder, David; Sell, Gregory; McCree, Alan; Povey, Daniel; Khudanpur, Sanjeev

doi:10.21437/interspeech.2019-2205

Cited by 38 publications

(18 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The code for this paper can be found at: https://github.com/clovaai/voxceleb_trainer popular due to their ease of implementation and good performance [17,18,19,20,21,22,23,24]. However, training with AM-Softmax and AAM-Softmax has proven to be challenging since they are sensitive to the value of scale and margin in the loss function.…”

Section: Introductionmentioning

confidence: 99%

In defence of metric learning for speaker recognition

Chung,

Huh,

Mun

et al. 2020

Preprint

View full text Add to dashboard Cite

The objective of this paper is 'open-set' speaker recognition of unseen speakers, where ideal embeddings should be able to condense information into a compact utterance-level representation that has small intra-class (same speaker) and large inter-class (different speakers) distance.A popular belief in speaker recognition is that networks trained with classification objectives outperform metric learning methods. In this paper, we present an extensive evaluation of most recent loss functions for speaker recognition on the VoxCeleb dataset. We demonstrate that even the vanilla triplet loss shows competitive performance compared to classification-based losses, and those trained with our angular metric learning objective outperform state-of-the-art methods.

show abstract

Section: Introductionmentioning

confidence: 99%

In defence of metric learning for speaker recognition

Chung,

Huh,

Mun

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Moreover, this can potentially correct the duration mismatch between training and test conditions [15]. An effective method to decrease GPU memory requirements and to prevent overfitting when training with longer length utterances is to freeze the pre-pooling layers of the model [16]. However, we argue this can prevent these layers from sufficiently adapting to the increased duration condition, especially when such layers share global statistics through the SE-blocks in the ECAPA-TDNN architecture.…”

Section: Fine-tuning Configurationmentioning

confidence: 99%

The Idlab Voxsrc-20 Submission: Large Margin Fine-Tuning and Quality-Aware Score Calibration in DNN Based Speaker Verification

Thienpondt

Desplanques

Demuynck

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this paper we propose and analyse a large margin fine-tuning strategy and a quality-aware score calibration in text-independent speaker verification. Large margin fine-tuning is a secondary training stage for DNN based speaker verification systems trained with margin-based loss functions. It enables the network to create more robust speaker embeddings by enabling the use of longer training utterances in combination with a more aggressive margin penalty. Score calibration is a common practice in speaker verification systems to map output scores to well-calibrated log-likelihood-ratios, which can be converted to interpretable probabilities. By including quality features in the calibration system, the decision thresholds of the evaluation metrics become quality-dependent and more consistent across varying trial conditions. Applying both enhancements on the ECAPA-TDNN architecture leads to state-of-the-art results on all publicly available VoxCeleb1 test sets and contributed to our winning submissions in the supervised verification tracks of the Vox-Celeb Speaker Recognition Challenge 2020.

show abstract

“…It was found that the lower dimension of segment 6 and 7 helped in Speaker Verification in the case of 5-second-long utterances, but achieved higher EER on the original long utterances on the NIST SRE 2010 dataset. On the other hand, Garcia-Romero et al [32] tried to optimize the x-vector system for long utterances (with 2-4 seconds duration) by a DNN refinement approach that updates a subset of the DNN parameters with full recordings and modifies the DNN architecture to produce embeddings optimized for cosine distance scoring. The results show that the method produces lower minDCF (minimum Decision Cost Function), but slightly higher EER than the baseline x-vector approach.…”

Section: The X-vectormentioning

confidence: 99%

Deep Learning Methods in Speaker Recognition: A Review

Schiferl

Szaszák

Beke

2021

Period. Polytech. Elec. Eng. Comp. Sci.

View full text Add to dashboard Cite

This paper reviews the applied Deep Learning (DL) practices in the field of Speaker Recognition (SR), both in verification and identification. Speaker Recognition has been a widely used topic of speech technology. Many research works have been carried out and little progress has been achieved in the past 5–6 years. However, as Deep Learning techniques do advance in most machine learning fields, the former state-of-the-art methods are getting replaced by them in Speaker Recognition too. It seems that Deep Learning becomes the now state-of-the-art solution for both Speaker Verification (SV) and identification. The standard x-vectors, additional to i-vectors, are used as baseline in most of the novel works. The increasing amount of gathered data opens up the territory to Deep Learning, where they are the most effective.

show abstract

x-Vector DNN Refinement with Full-Length Recordings for Speaker Recognition

Cited by 38 publications

References 21 publications

In defence of metric learning for speaker recognition

In defence of metric learning for speaker recognition

The Idlab Voxsrc-20 Submission: Large Margin Fine-Tuning and Quality-Aware Score Calibration in DNN Based Speaker Verification

Deep Learning Methods in Speaker Recognition: A Review

Contact Info

Product

Resources

About