2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2017
DOI: 10.1109/icassp.2017.7953194
|View full text |Cite
|
Sign up to set email alerts
|

TristouNet: Triplet loss for speaker turn embedding

Abstract: TristouNet is a neural network architecture based on Long Short-Term Memory recurrent networks, meant to project speech sequences into a fixed-dimensional euclidean space. Thanks to the triplet loss paradigm used for training, the resulting sequence embeddings can be compared directly with the euclidean distance, for speaker comparison purposes. Experiments on short (between 500ms and 5s) speech turn comparison and speaker change detection show that TristouNet brings significant improvements over the current s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
162
0
2

Year Published

2017
2017
2020
2020

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 162 publications
(165 citation statements)
references
References 13 publications
1
162
0
2
Order By: Relevance
“…In this paper, we only consider the development set to compare with state-of-the-art methods. Specifically, we use similar settings for the "same/different" audio experiments than in [5]. The models learned with REPERE will be applied on ETAPE to benchmark their generalization ability.…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…In this paper, we only consider the development set to compare with state-of-the-art methods. Specifically, we use similar settings for the "same/different" audio experiments than in [5]. The models learned with REPERE will be applied on ETAPE to benchmark their generalization ability.…”
Section: Methodsmentioning
confidence: 99%
“…We compare our speaker turn embedding with 3 approaches: Bayesian Information Criterion (BIC) [8], Gaussian divergence (Div.) [2], and the original TristouNet [5].…”
Section: Implementation Detailsmentioning
confidence: 99%
See 2 more Smart Citations
“…First, it can be extracted as the derivatives of the speaker recognition task by using the activation of the last layer before classification [4,6,7]. Second, it can be learned directly by optimizing the loss functions constraining the distances between same-speaker and different-speaker utterance pairs [3,8,9]. Among the distance-based losses, triplet loss has become more and more widely used in deep embedding networks [2,3,8].…”
Section: Introductionmentioning
confidence: 99%