2021
DOI: 10.1016/j.neunet.2021.04.020
|View full text |Cite
|
Sign up to set email alerts
|

Combination of deep speaker embeddings for diarisation

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
8
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 11 publications
(8 citation statements)
references
References 91 publications
0
8
0
Order By: Relevance
“…A pooling layer is often used to convert the frame-level speaker representations to a fixed-length vector for several seconds of audio. Commonly used pooling layers include the means and standard deviations [24] and those with attention mechanisms [1,25] Rather than using the normal cross entropy loss with a softmax activation, some additional loss functions have been proposed to better discriminate between different speaker classes, such as the angular softmax loss [26]. The additive angular margin loss [27] is used for all SC training in this paper.…”
Section: Sc For Speaker Embedding Extraction and Clusteringmentioning
confidence: 99%
See 2 more Smart Citations
“…A pooling layer is often used to convert the frame-level speaker representations to a fixed-length vector for several seconds of audio. Commonly used pooling layers include the means and standard deviations [24] and those with attention mechanisms [1,25] Rather than using the normal cross entropy loss with a softmax activation, some additional loss functions have been proposed to better discriminate between different speaker classes, such as the angular softmax loss [26]. The additive angular margin loss [27] is used for all SC training in this paper.…”
Section: Sc For Speaker Embedding Extraction and Clusteringmentioning
confidence: 99%
“…The data preparation of the AMI corpus for ASR follows the pipeline provided in ESPnet [31]. The same references for speaker diarisation from [1] are used, where the silences in the manual segmentation from the official AMI release were stripped by using a forced alignment with the reference transcriptions and a preexisting speech recognition system [32]. These alignments were also used to generate the speech/nonspeech labels for training the VAD.…”
Section: Experimental Setup 41 Datamentioning
confidence: 99%
See 1 more Smart Citation
“…M ULTI-PARTY interactions such as meetings and conversations are one of the most important scenarios for many speech and language applications [1]. Speaker change detection (SCD), the task of finding the time points that a new speaker starts to speak, is critical for such applications and has received increasing attention in recent years [2]- [5].…”
Section: Introductionmentioning
confidence: 99%
“…Given that there is empirically no unique outstanding acoustic features or models suitable for various test scenarios, another thought for improving the performance of speaker recognition is to combine complementary acoustic features or models. Sun et al [17] proposed a c-vector method by combining multiple sets of complementary d-vectors derived from systems with different neural network components, including 2-dimensional self-attentive, gated additive, bilinear pooling structures, etc. Language recognition shares quite a similar research trend with speaker recognition.…”
mentioning
confidence: 99%