Novel Architectures for Unsupervised Information Bottleneck Based Speaker Diarization of Meetings

Dawalatabad, Nauman; Madikeri, Srikanth; Sekhar, C. Chandra; Murthy, Hema A.

doi:10.1109/taslp.2020.3036231

Cited by 14 publications

(6 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A review of literature sources gives grounds to state that diarization algorithms are traditionally based on an unsupervised approach [ 8 , 9 , 10 ]. Stagewise speaker diarization architectures containing a sequence of modules such as speech detection, speech segmentation, embedding extraction, clustering, and labeling clusters have been studied for a long time [ 2 , 10 , 11 ].…”

Section: Materials and Methodsmentioning

confidence: 99%

Development of Supervised Speaker Diarization System Based on the PyAnnote Audio Processing Library

Khoma

Brydinskyi

et al. 2023

Sensors

View full text Add to dashboard Cite

Diarization is an important task when work with audiodata is executed, as it provides a solution to the problem related to the need of dividing one analyzed call recording into several speech recordings, each of which belongs to one speaker. Diarization systems segment audio recordings by defining the time boundaries of utterances, and typically use unsupervised methods to group utterances belonging to individual speakers, but do not answer the question “who is speaking?” On the other hand, there are biometric systems that identify individuals on the basis of their voices, but such systems are designed with the prerequisite that only one speaker is present in the analyzed audio recording. However, some applications involve the need to identify multiple speakers that interact freely in an audio recording. This paper proposes two architectures of speaker identification systems based on a combination of diarization and identification methods, which operate on the basis of segment-level or group-level classification. The open-source PyAnnote framework was used to develop the system. The performance of the speaker identification system was verified through the application of the AMI Corpus open-source audio database, which contains 100 h of annotated and transcribed audio and video data. The research method consisted of four experiments to select the best-performing supervised diarization algorithms on the basis of PyAnnote. The first experiment was designed to investigate how the selection of the distance function between vector embedding affects the reliability of identification of a speaker’s utterance in a segment-level classification architecture. The second experiment examines the architecture of cluster-centroid (group-level) classification, i.e., the selection of the best clustering and classification methods. The third experiment investigates the impact of different segmentation algorithms on the accuracy of identifying speaker utterances, and the fourth examines embedding window sizes. Experimental results demonstrated that the group-level approach offered better identification results were compared to the segment-level approach, and the latter had the advantage of real-time processing.

show abstract

Section: Materials and Methodsmentioning

confidence: 99%

Development of Supervised Speaker Diarization System Based on the PyAnnote Audio Processing Library

Khoma

Brydinskyi

et al. 2023

Sensors

View full text Add to dashboard Cite

show abstract

“…However, the recently evolved deep learning frameworks like the recurrent neural network (RNN), time-delay neural network (TDNN), and transformer, etc. are able to show their success in the modeling of longterm dynamics [8]- [11]. Further, the perception study shown in [2], shows humans can able to recognize the language, without knowing the grammatical details of the language.…”

Section: Introductionmentioning

confidence: 95%

“…3) Diarization with fixed segmentation: The fixed segmentation-based diarization framework is decided by motivating from the proposed SD frameworks in [8], [20]. For performing the LD/SD, the given CS/multi-speaker test utterance is used to extract the 39 dimensional MFCC features by considering 0.02 and 0.01 seconds as framesize and frameshift.…”

Section: A Diarization With Implicit X-vector Representationmentioning

confidence: 99%

Importance of Supra-Segmental Information and Self-Supervised Framework for Spoken Language Diarization Task

Mishra

Prasanna

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

In a code-switched (CS) scenario, the use of spoken language diarization (LD) as a pre-possessing system is essential. Further, the use of implicit frameworks is preferable over the explicit framework, as it can be easily adapted to deal with low/zero resource languages. Inspired by speaker diarization (SD) literature, three frameworks based on (1) fixed segmentation, (2) change point-based segmentation and (3) E2E are proposed to perform LD. The initial exploration with synthetic TTSF-LD dataset shows, using x-vector as implicit language representation with appropriate analysis window length (N ) can able to achieve at per performance with explicit LD. The best implicit LD performance of 6.38 in terms of Jaccard error rate (JER) is achieved by using the E2E framework. However, considering the E2E framework the performance of implicit LD degrades to 60.4 while using with practical Microsoft CS (MSCS) dataset. The difference in performance is mostly due to the distributional difference between the monolingual segment duration of secondary language in the MSCS and TTSF-LD datasets. Moreover, to avoid segment smoothing, the smaller duration of the monolingual segment suggests the use of a small value of N . At the same time with small N , the x-vector representation is unable to capture the required language discrimination due to the acoustic similarity, as the same speaker is speaking both languages. Therefore, to resolve the issue a self-supervised implicit language representation is proposed in this study. In comparison with the x-vector representation, the proposed representation provides a relative improvement of 63.9% and achieved a JER of 21.8 using the E2E framework.

show abstract

“…With this simple trick, we can synthesize a speech signal that sounds a bit faster or slower than the original one. This is useful as the speaking rate may vary within and across speakers [21]. To avoid changing the speaker characteristics significantly, we restrict the speed perturbation to a maximum of ±5%.…”

Section: Data Augmentationmentioning

confidence: 99%

ECAPA-TDNN Embeddings for Speaker Diarization

Dawalatabad¹,

Ravanelli²,

Grondin³

et al. 2021

Interspeech 2021

Self Cite

View full text Add to dashboard Cite

Learning robust speaker embeddings is a crucial step in speaker diarization. Deep neural networks can accurately capture speaker discriminative characteristics and popular deep embeddings such as x-vectors are nowadays a fundamental component of modern diarization systems. Recently, some improvements over the standard TDNN architecture used for x-vectors have been proposed. The ECAPA-TDNN model, for instance, has shown impressive performance in the speaker verification domain, thanks to a carefully designed neural model.In this work, we extend, for the first time, the use of the ECAPA-TDNN model to speaker diarization. Moreover, we improved its robustness with a powerful augmentation scheme that concatenates several contaminated versions of the same signal within the same training batch. The ECAPA-TDNN model turned out to provide robust speaker embeddings under both close-talking and distant-talking conditions. Our results on the popular AMI meeting corpus show that our system significantly outperforms recently proposed approaches.

show abstract

Novel Architectures for Unsupervised Information Bottleneck Based Speaker Diarization of Meetings

Cited by 14 publications

References 39 publications

Development of Supervised Speaker Diarization System Based on the PyAnnote Audio Processing Library

Development of Supervised Speaker Diarization System Based on the PyAnnote Audio Processing Library

Importance of Supra-Segmental Information and Self-Supervised Framework for Spoken Language Diarization Task

ECAPA-TDNN Embeddings for Speaker Diarization

Contact Info

Product

Resources

About