2018
DOI: 10.48550/arxiv.1811.02066
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

How to Improve Your Speaker Embeddings Extractor in Generic Toolkits

Abstract: Recently, speaker embeddings extracted with deep neural networks became the state-of-the-art method for speaker verification. In this paper we aim to facilitate its implementation on a more generic toolkit than Kaldi, which we anticipate to enable further improvements on the method. We examine several tricks in training, such as the effects of normalizing input features and pooled statistics, different methods for preventing overfitting as well as alternative nonlinearities that can be used instead of Rectifie… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
7
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(8 citation statements)
references
References 22 publications
1
7
0
Order By: Relevance
“…Different with Kaldi, there is no dilation used. This performs better in our experiments and is also suggested in other works [24]. Statistics pooling and a 2-layer segment-level network is appended after the frame-level network.…”
Section: Training Detailssupporting
confidence: 56%
See 2 more Smart Citations
“…Different with Kaldi, there is no dilation used. This performs better in our experiments and is also suggested in other works [24]. Statistics pooling and a 2-layer segment-level network is appended after the frame-level network.…”
Section: Training Detailssupporting
confidence: 56%
“…The loss converges after the learning rate goes down below 10 −5 , resulting to around 2.5M training steps. No dropout is applied in our networks as described in [24].…”
Section: Training Detailsmentioning
confidence: 99%
See 1 more Smart Citation
“…The kernel sizes of the first five layers are 5, 3, 3, 1 and 1, while the dilation rates are set to 1, 2, 3, 1 and 1 respectively. The same type of L2 weight decay and batch normalization as described in [29] are used in the baseline system to prevent overfitting. ACNN: In this system, the ACNN is only applied in the fourth frame-level layer.…”
Section: Model Configurationmentioning
confidence: 99%
“…However, [6] a recent work has shown that data augmentation, consisting of added noise and reverberation, can significantly improve the performance of these embeddings (xvectors as they referred to), while it is not so effective for ivectors [6]. There have also been some efforts to improve the quality and generalization powers of x-vectors by the modification applied to the network architecture [10] and the training procedure [11,12,13].…”
Section: Introductionmentioning
confidence: 99%