Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1489
|View full text |Cite
|
Sign up to set email alerts
|

Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System

Abstract: Deep embedding learning based speaker verification (SV) methods have recently achieved significant performance improvement over traditional i-vector systems, especially for short duration utterances. Embedding learning commonly consists of three components: frame-level feature processing, utterancelevel embedding learning, and loss function to discriminate between speakers. For the learned embeddings, a back-end model (i.e., Linear Discriminant Analysis followed by Probabilistic Linear Discriminant Analysis (L… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
59
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 68 publications
(59 citation statements)
references
References 20 publications
0
59
0
Order By: Relevance
“…For all the pooling methods mentioned above, we use only single-scale feature map from the last layer of the feature extractor. Recently, multi-scale aggregation (MSA) methods have been proposed to exploit speaker information at multiple time scales [22], [23], [48], [49], showing the effectiveness in dealing with variable-duration test utterances.…”
Section: Deep Speaker Embedding Learningmentioning
confidence: 99%
See 2 more Smart Citations
“…For all the pooling methods mentioned above, we use only single-scale feature map from the last layer of the feature extractor. Recently, multi-scale aggregation (MSA) methods have been proposed to exploit speaker information at multiple time scales [22], [23], [48], [49], showing the effectiveness in dealing with variable-duration test utterances.…”
Section: Deep Speaker Embedding Learningmentioning
confidence: 99%
“…Even with this robustness, using multi-scale features from multiple layers (Fig. 2(b)), called multi-scale aggregation (MSA), has shown better performance than using singlescale feature maps [22], [23], [48], [49]. Note that, between the frame-and segment-level operations, we should choose the segment-level operation for the MSA because all the feature maps from different layers have the same time scale in the frame-level operation.…”
Section: Multi-scale Aggregationmentioning
confidence: 99%
See 1 more Smart Citation
“…The state-of-the-art text-independent speaker verification systems [1][2][3][4] use deep neural networks (DNNs) to project speech recordings with different lengths into a common low dimensional embedding space where the speakers' identities are represented. Such a method is called deep embedding, where the embedding networks have three key components-network structure [1,3,[5][6][7], pooling layer [1,[8][9][10][11][12], and loss function [13][14][15][16][17]. This paper focuses on the last part, i.e., the loss functions.…”
Section: Introductionmentioning
confidence: 99%
“…To address this problem, several studies have applied a pooling layer or temporal average layer to an end-to-end system [2,3]. The second is a speaker embedding-based system [4][5][6][7][8][9][10][11][12][13][14], which generates an input of variable length into a vector of fixed length using a DNN. The generated vector is used as an embedding to represent the speaker.…”
Section: Introductionmentioning
confidence: 99%