Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-2210
|View full text |Cite
|
Sign up to set email alerts
|

Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3

Citation Types

0
3
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 14 publications
(3 citation statements)
references
References 0 publications
0
3
0
Order By: Relevance
“…A typical speaker verification process involves two stages: First, a few utterances of the speaker are enrolled, then the identity information extracted from the test utterance is compared with that of the enrolled utterances for verification [1]. ASV researchers have been developing speaker embedding extraction methods [2,3,4] to encode speaker identity information for verification. However, it is likely that the test utterance is not human natural speech but spoofing attacks that try to deceive the ASV system.…”
Section: Introductionmentioning
confidence: 99%
“…A typical speaker verification process involves two stages: First, a few utterances of the speaker are enrolled, then the identity information extracted from the test utterance is compared with that of the enrolled utterances for verification [1]. ASV researchers have been developing speaker embedding extraction methods [2,3,4] to encode speaker identity information for verification. However, it is likely that the test utterance is not human natural speech but spoofing attacks that try to deceive the ASV system.…”
Section: Introductionmentioning
confidence: 99%
“…Moreover, introducing attention-based modules has become popular in speaker verification recently. Works like attentive statistical pooling [5], cas pooling [4], self-attentive speaker embedding [6] and serialized multi-layer multi-head attention [7] prove that attention mechanism is effective in aggregating frame-level features.…”
Section: Introductionmentioning
confidence: 99%
“…Transformer, first proposed in [8], is also an attentionbased structure and has made a lot of successes in various areas including natural language processing [9] and computer vision [10,11,12], inspired by which we propose a pooling transformer, PoFormer by introducing transformer to the pooling layer of our speaker verification network to strengthen its capability of capturing information across the whole time domain. Different from [7] which re-designs the inner structure of the attention module and implements the multi-head mechanism serially, we strictly follows [8] where different heads are in parallel, providing a simple but effective pooling transformer.…”
Section: Introductionmentioning
confidence: 99%