ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054448
|View full text |Cite
|
Sign up to set email alerts
|

H-Vectors: Utterance-Level Speaker Embedding Using a Hierarchical Attention Model

Abstract: In this paper, a hierarchical attention network to generate utterance-level embeddings (H-vectors) for speaker identification is proposed. Since different parts of an utterance may have different contributions to speaker identities, the use of hierarchical structure aims to learn speaker related information locally and globally. In the proposed approach, frame-level encoder and attention are applied on segments of an input utterance and generate individual segment vectors. Then, segment level attention is appl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
4
1

Relationship

3
7

Authors

Journals

citations
Cited by 14 publications
(11 citation statements)
references
References 18 publications
0
11
0
Order By: Relevance
“…XI. For completeness we also mention h-vectors [152] which use a hierarchical attention mechanism to produce utterance-level embeddings, but has only been applied to speaker recognition tasks.…”
Section: Neural Network Embeddingsmentioning
confidence: 99%
“…XI. For completeness we also mention h-vectors [152] which use a hierarchical attention mechanism to produce utterance-level embeddings, but has only been applied to speaker recognition tasks.…”
Section: Neural Network Embeddingsmentioning
confidence: 99%
“…In addition to feed-forward deep neural networks (DNNs) (Hinton et al, 2012), recurrent and convolutional models have also been applied to extract d-vectors at frame-level (Variani et al, 2014;Yella and Stolcke, 2015;Heigold et al, 2016;Cyrta et al, 2017;Wang et al, 2018b). To convert a variable length segment into a fixed-length vector using frame-level d-vectors, a temporal pooling function, such as the mean and standard deviation (Garcia-Romero et al, 2017;Diez et al, 2019;Wang et al, 2018c), attention mechanisms (Chowdhury et al, 2018;Zhu et al, 2018;Sun et al, 2019;Shi et al, 2020), and their combination (Okabe et al, 2018), have been used, which also enables joint training over entire segments.…”
Section: Related Workmentioning
confidence: 99%
“…Four strong baselines are chosen to compare with the proposed model: X-vectors [17], Attentive X-vector (Att-Xvector) [18,19,2,20], H-vector [6,21] and S-vector [22]. X-vectors is TDNN based model, which contains a TDNN based frame-level feature extractor, a statistics pooling operation and a segment-level feature extractor.…”
Section: Experiments Setupmentioning
confidence: 99%