2021
DOI: 10.1109/taslp.2020.3036231
|View full text |Cite
|
Sign up to set email alerts
|

Novel Architectures for Unsupervised Information Bottleneck Based Speaker Diarization of Meetings

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2021
2021
2025
2025

Publication Types

Select...
4
3
1

Relationship

2
6

Authors

Journals

citations
Cited by 14 publications
(6 citation statements)
references
References 39 publications
0
6
0
Order By: Relevance
“…A review of literature sources gives grounds to state that diarization algorithms are traditionally based on an unsupervised approach [ 8 , 9 , 10 ]. Stagewise speaker diarization architectures containing a sequence of modules such as speech detection, speech segmentation, embedding extraction, clustering, and labeling clusters have been studied for a long time [ 2 , 10 , 11 ].…”
Section: Materials and Methodsmentioning
confidence: 99%
“…A review of literature sources gives grounds to state that diarization algorithms are traditionally based on an unsupervised approach [ 8 , 9 , 10 ]. Stagewise speaker diarization architectures containing a sequence of modules such as speech detection, speech segmentation, embedding extraction, clustering, and labeling clusters have been studied for a long time [ 2 , 10 , 11 ].…”
Section: Materials and Methodsmentioning
confidence: 99%
“…However, the recently evolved deep learning frameworks like the recurrent neural network (RNN), time-delay neural network (TDNN), and transformer, etc. are able to show their success in the modeling of longterm dynamics [8]- [11]. Further, the perception study shown in [2], shows humans can able to recognize the language, without knowing the grammatical details of the language.…”
Section: Introductionmentioning
confidence: 95%
“…3) Diarization with fixed segmentation: The fixed segmentation-based diarization framework is decided by motivating from the proposed SD frameworks in [8], [20]. For performing the LD/SD, the given CS/multi-speaker test utterance is used to extract the 39 dimensional MFCC features by considering 0.02 and 0.01 seconds as framesize and frameshift.…”
Section: A Diarization With Implicit X-vector Representationmentioning
confidence: 99%
“…With this simple trick, we can synthesize a speech signal that sounds a bit faster or slower than the original one. This is useful as the speaking rate may vary within and across speakers [21]. To avoid changing the speaker characteristics significantly, we restrict the speed perturbation to a maximum of ±5%.…”
Section: Data Augmentationmentioning
confidence: 99%