2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2021
DOI: 10.1109/asru51503.2021.9688271
|View full text |Cite
|
Sign up to set email alerts
|

Self-Supervised Metric Learning With Graph Clustering For Speaker Diarization

Abstract: In this paper, we propose a novel algorithm for speaker diarization using metric learning for graph based clustering. The graph clustering algorithms use an adjacency matrix consisting of similarity scores. These scores are computed between speaker embeddings extracted from pairs of audio segments within the given recording. In this paper, we propose an approach that jointly learns the speaker embeddings and the similarity metric using principles of self-supervised learning. The metric learning network impleme… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 34 publications
0
5
0
Order By: Relevance
“…The conventional cascaded approach for speaker diarization consists of the following operations: 1) speech activity detection (SAD), 2) speaker embedding extraction from each detected speech segment, 3) clustering of the embeddings, and 4) optional overlap handling. The oracle SAD is sometimes used in the experiments, but the remaining parts are actively being studied in the literature: better speaker embedding extraction methods [35]- [38], clustering methods [11], [13], [39], and overlap assignment methods [22], [40], [41]. The cascaded approach is based on unsupervised clustering; thus, the number of output speakers can take an arbitrary value and can be set flexibly during inference.…”
Section: A Offline Diarizationmentioning
confidence: 99%
“…The conventional cascaded approach for speaker diarization consists of the following operations: 1) speech activity detection (SAD), 2) speaker embedding extraction from each detected speech segment, 3) clustering of the embeddings, and 4) optional overlap handling. The oracle SAD is sometimes used in the experiments, but the remaining parts are actively being studied in the literature: better speaker embedding extraction methods [35]- [38], clustering methods [11], [13], [39], and overlap assignment methods [22], [40], [41]. The cascaded approach is based on unsupervised clustering; thus, the number of output speakers can take an arbitrary value and can be set flexibly during inference.…”
Section: A Offline Diarizationmentioning
confidence: 99%
“…• Another line of works [Singh and Ganapathy, 2020, Singh and Ganapathy, 2021b, Singh and Ganapathy, 2021a] also explores refinement of the affinity matrix and graphbased clustering. In [Singh and Ganapathy, 2020], a triplet loss scheme is used to train a DNN that refines the cosine similarity-based affinity matrix for AHC.…”
Section: Neural Network For Speaker Embeddings and Affinity Matricesmentioning
confidence: 99%
“…In [Singh and Ganapathy, 2021b], path integral clustering (a graph-structural agglomerative clustering algorithm) is used to define the clusters resulting in better performance than AHC. In [Singh and Ganapathy, 2021a] the similarities are given by a PLDA model whose parameters are updated as part of the training process to improve the performance with respect to cosine similarity.…”
Section: Neural Network For Speaker Embeddings and Affinity Matricesmentioning
confidence: 99%
“…Speaker diarization 1) Offline diarization: The conventional cascaded approach for speaker diarization consists of the following operations: 1) speech activity detection (SAD), 2) speaker embedding extraction from each detected speech segment, 3) clustering of the embeddings, and 4) optional overlap handling. The oracle SAD is sometimes used in the experiments, but the remaining parts are actively being studied in the literature: better speaker embedding extraction methods [34]- [37], clustering methods [11], [14], [38], and overlap assignment methods [22], [39], [40]. The cascaded approach is based on unsupervised clustering; thus, the number of output speakers can take an arbitrary value and can be set flexibly during inference.…”
Section: Related Workmentioning
confidence: 99%