Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-941
|View full text |Cite
|
Sign up to set email alerts
|

ECAPA-TDNN Embeddings for Speaker Diarization

Abstract: Learning robust speaker embeddings is a crucial step in speaker diarization. Deep neural networks can accurately capture speaker discriminative characteristics and popular deep embeddings such as x-vectors are nowadays a fundamental component of modern diarization systems. Recently, some improvements over the standard TDNN architecture used for x-vectors have been proposed. The ECAPA-TDNN model, for instance, has shown impressive performance in the speaker verification domain, thanks to a carefully designed ne… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
20
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 45 publications
(20 citation statements)
references
References 30 publications
0
20
0
Order By: Relevance
“…( 2) Convolutional layers to extract emotion embeddings and SVM to classify embeddings into emotion classes (termed as DNN-SVM). Our selection of these is inspired from the success of embeddings based networks [12,82,84] and fully DNN-based frameworks [40] in speech processing. Performance evaluation over these also enables us to compare the SER efficiency of the two DNN frameworks.…”
Section: Classifier Descriptionmentioning
confidence: 99%
“…( 2) Convolutional layers to extract emotion embeddings and SVM to classify embeddings into emotion classes (termed as DNN-SVM). Our selection of these is inspired from the success of embeddings based networks [12,82,84] and fully DNN-based frameworks [40] in speech processing. Performance evaluation over these also enables us to compare the SER efficiency of the two DNN frameworks.…”
Section: Classifier Descriptionmentioning
confidence: 99%
“…Our choice of these architectures was inspired by the success of techniques such as 1D and 2D convolutions, LSTM, attention mechanism, squeeze and excitation module, Res2Net module, etc. in different speech processing domains [15,29,[65][66][67][68].…”
Section: Neural Network Architecturesmentioning
confidence: 99%
“…These include using multiple 1D Res2Net modules, squeeze and excitation blocks, and channel-dependent time-frame attention. The ECAPA-TDNN architecture is found useful for speaker recognition and speaker diarization tasks [67]. In this work, we used the implementation of ECAPA-TDNN provided in SpeechBrain 4 Python toolkit without any change in parameter configuration.…”
Section: Ecapa-tdnnmentioning
confidence: 99%
“…Speaker diarization 1) Offline diarization: The conventional cascaded approach for speaker diarization consists of the following operations: 1) speech activity detection (SAD), 2) speaker embedding extraction from each detected speech segment, 3) clustering of the embeddings, and 4) optional overlap handling. The oracle SAD is sometimes used in the experiments, but the remaining parts are actively being studied in the literature: better speaker embedding extraction methods [34]- [37], clustering methods [11], [14], [38], and overlap assignment methods [22], [39], [40]. The cascaded approach is based on unsupervised clustering; thus, the number of output speakers can take an arbitrary value and can be set flexibly during inference.…”
Section: Related Workmentioning
confidence: 99%