Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1025
|View full text |Cite
|
Sign up to set email alerts
|

Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances

Abstract: Currently, the most widely used approach for speaker verification is the deep speaker embedding learning. In this approach, we obtain a speaker embedding vector by pooling single-scale features that are extracted from the last layer of a speaker feature extractor. Multi-scale aggregation (MSA), which utilizes multiscale features from different layers of the feature extractor, has recently been introduced and shows superior performance for variable-duration utterances. To increase the robustness dealing with ut… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
7
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 32 publications
(7 citation statements)
references
References 25 publications
0
7
0
Order By: Relevance
“…In Equation 17, C F R is the cost of missed detection; C F A is the cost of a spurious detection; P T arget is a prior probability of the specified target speaker. In this paper, we set C F R = C F A = 1 and P T arget = 0.01, which is commonly used in relevant experiments [27,23].…”
Section: Evaluation Metricmentioning
confidence: 99%
See 2 more Smart Citations
“…In Equation 17, C F R is the cost of missed detection; C F A is the cost of a spurious detection; P T arget is a prior probability of the specified target speaker. In this paper, we set C F R = C F A = 1 and P T arget = 0.01, which is commonly used in relevant experiments [27,23].…”
Section: Evaluation Metricmentioning
confidence: 99%
“…strided convolution), some temporal information is lost due to the reduced time dimensions. To address this problem, multi-scale aggregation methods [21,22,23] are gradually introduced into CNN models. However, these methods usually perform global average pooling (GAP) to aggregate features, leading to information loss of speaker characteristics in both time and frequency dimension.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Several modifications of the backbone architecture are made to improve the performance. These modifications include adding a channel attention [8], transforming the custom convolution into a multi-scale convolution [9,10], and aggregating multi-layer or multi-stage features [4,11]. However, all these methods above only focus on the improvement of singlebranch structures and neglect a multi-branch way of designing neural networks.…”
Section: Introductionmentioning
confidence: 99%
“…In SV, speaker representations are derived by first extracting the frame-level features and then aggregating them. The network extracting the frame-level features is referred to as trunk network (e.g., convolutional neural network (CNN) and x-vector [1][2][3][4][5][6]). After extracting the frame-level features, various techniques including gated recurrent network (GRU), learnable dictionary encoding (LDE) are used on top of the trunk network to aggregate framelevel features into a single utterance-level feature [7][8][9][10][11][12][13][14][15][16][17][18].…”
Section: Introductionmentioning
confidence: 99%