ECAPA-TDNN Embeddings for Speaker Diarization

Dawalatabad, Nauman; Ravanelli, Mirco; Grondin, François; Thienpondt, Jenthe; Desplanques, Brecht; Na, Hwidong

doi:10.21437/interspeech.2021-941

Cited by 45 publications

(20 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…( 2) Convolutional layers to extract emotion embeddings and SVM to classify embeddings into emotion classes (termed as DNN-SVM). Our selection of these is inspired from the success of embeddings based networks [12,82,84] and fully DNN-based frameworks [40] in speech processing. Performance evaluation over these also enables us to compare the SER efficiency of the two DNN frameworks.…”

Section: Classifier Descriptionmentioning

confidence: 99%

Modulation spectral features for speech emotion recognition using deep neural networks

Singh

Sahidullah

Saha

2023

Speech Communication

View full text Add to dashboard Cite

Section: Classifier Descriptionmentioning

confidence: 99%

Modulation spectral features for speech emotion recognition using deep neural networks

Singh

Sahidullah

Saha

2023

Speech Communication

View full text Add to dashboard Cite

“…Our choice of these architectures was inspired by the success of techniques such as 1D and 2D convolutions, LSTM, attention mechanism, squeeze and excitation module, Res2Net module, etc. in different speech processing domains [15,29,[65][66][67][68].…”

Section: Neural Network Architecturesmentioning

confidence: 99%

“…These include using multiple 1D Res2Net modules, squeeze and excitation blocks, and channel-dependent time-frame attention. The ECAPA-TDNN architecture is found useful for speaker recognition and speaker diarization tasks [67]. In this work, we used the implementation of ECAPA-TDNN provided in SpeechBrain 4 Python toolkit without any change in parameter configuration.…”

Section: Ecapa-tdnnmentioning

confidence: 99%

Analysis of constant-Q filterbank based representations for speech emotion recognition

Singh

Waldekar

Sahidullah

et al. 2022

Digital Signal Processing

View full text Add to dashboard Cite

“…Speaker diarization 1) Offline diarization: The conventional cascaded approach for speaker diarization consists of the following operations: 1) speech activity detection (SAD), 2) speaker embedding extraction from each detected speech segment, 3) clustering of the embeddings, and 4) optional overlap handling. The oracle SAD is sometimes used in the experiments, but the remaining parts are actively being studied in the literature: better speaker embedding extraction methods [34]- [37], clustering methods [11], [14], [38], and overlap assignment methods [22], [39], [40]. The cascaded approach is based on unsupervised clustering; thus, the number of output speakers can take an arbitrary value and can be set flexibly during inference.…”

Section: Related Workmentioning

confidence: 99%