Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1446
|View full text |Cite
|
Sign up to set email alerts
|

Self-Attention Encoding and Pooling for Speaker Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
35
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 55 publications
(35 citation statements)
references
References 0 publications
0
35
0
Order By: Relevance
“…Safar et al [20] proposed a self-attention pooling layer and showed it is good at extracting the time-invariant features information. We utilize the self-attention pooling layer to extract the representation from the target encoder.…”
Section: Self-attention Poolingmentioning
confidence: 99%
“…Safar et al [20] proposed a self-attention pooling layer and showed it is good at extracting the time-invariant features information. We utilize the self-attention pooling layer to extract the representation from the target encoder.…”
Section: Self-attention Poolingmentioning
confidence: 99%
“…To design the lightweight models, Nunes et al [41] proposed a portable model called additive margin MobileNet1D (AM-MobileNet1D) for speaker identification on mobile devices, which uses raw waveform of speeches as input. Safari et al [42] presented a deep speaker embedding architecture based on a self-attention encoding and pooling (SAEP) mechanism, which outperforms x-vector [5] with less parameters. In this paper, we construct the SV model via two specific lightweight techniques: depthwise separable convolution for reducing the parameters of convolutional layers and low-rank matrix factorization to decreasing the parameters of fully connected layers.…”
Section: Lightweight Architectures For Ti-svmentioning
confidence: 99%
“…Lately, there have been several architectures proposed to encode audio utterances into speaker embeddings for different choices of network This work was supported by the Spanish project PID2019-107579RB-I00 / AEI / 10.13039/501100011033. inputs, such as [5,6,7,8,9]. Using Mel-Frequency Cepstral Coefficient (MFCC) features, Time Delay Neural Network (TDNN) [5,6] is the most currently used architecture.…”
Section: Introductionmentioning
confidence: 99%
“…2-D CNNs have also shown competitive results for speaker verification. There are Computer Vision architectures such as VGG [10,7,11,9] and ResNet [8,12,13] that have been adapted to capture speaker discriminative information from the Mel-Spectrograms. In fact, Resnet34 has shown a better performance than TDNN in the most recent speaker verification challenges [14,15].…”
Section: Introductionmentioning
confidence: 99%