2022
DOI: 10.48550/arxiv.2204.01005
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Frequency and Multi-Scale Selective Kernel Attention for Speaker Verification

Abstract: Recent state-of-the-art speaker verification architectures adopt multi-scale processing and frequency-channel attention techniques. However, their full potential may not have been exploited because these techniques' receptive fields are fixed where most convolutional layers operate with specified kernel sizes such as 1, 3 or 5. We aim to further improve this line of research by introducing a selective kernel attention (SKA) mechanism. The SKA mechanism allows each convolutional layer to adaptively select the k… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
1
1

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 20 publications
0
2
0
Order By: Relevance
“…In addition, based on the SE-res2block structure, the output of the previous layer can be used as a skip connection structure to utilize multilayer information. As mentioned in [54], considering that shallower feature maps can yield more robust speaker embeddings, we extracted and used a 192-dimensional x-vector as our speaker embedding, a departure from the commonly employed 256-or 512dimensional embeddings. The Voxceleb2 dataset was used for training, and ADAMW was used as the optimizer.…”
Section: ) Speaker Encodermentioning
confidence: 99%
“…In addition, based on the SE-res2block structure, the output of the previous layer can be used as a skip connection structure to utilize multilayer information. As mentioned in [54], considering that shallower feature maps can yield more robust speaker embeddings, we extracted and used a 192-dimensional x-vector as our speaker embedding, a departure from the commonly employed 256-or 512dimensional embeddings. The Voxceleb2 dataset was used for training, and ADAMW was used as the optimizer.…”
Section: ) Speaker Encodermentioning
confidence: 99%
“…To force the speaker embeddings to discriminate their speaker labels, we adopt the combination of the additive angular margin (AAM) softmax [32] and the angular prototypical (AP) loss [33], which has shown the great performance in this field [4], [34]. Given the pairs of speaker embeddings and labels {(x s i , y s i )} N i=1 , the speaker classification loss function is formulated as follows: ,…”
Section: Speaker Classifier C Smentioning
confidence: 99%