Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: Theory, implementation and analysis on standard tasks

Landini, Federico; Profant, Ján; Díez, Mireia; Burget, Lukáš

doi:10.1016/j.csl.2021.101254

Cited by 102 publications

(84 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Before decoding with TS-VAD, we need an initial diarization result to get each speaker's segments for extracting corresponding i-vectors. M2MeT baseline [9] provides AHC with Variational Bayesian HMM clustering (VBx) [13]. First, for the speaker embedding network, we replace the baseline ResNet with ECAPA-tdnn(C=512) [14].…”

Section: Clustering-based Speaker Diarizationmentioning

confidence: 99%

The USTC-Ximalaya system for the ICASSP 2022 multi-channel multi-party meeting transcription (M2MeT) challenge

He¹,

Xiang²,

Zhou³

et al. 2022

Preprint

View full text Add to dashboard Cite

We propose two improvements to target-speaker voice activity detection (TS-VAD), the core component in our proposed speaker diarization system that was submitted to the 2022 Multi-Channel Multi-Party Meeting Transcription (M2MeT) challenge. These techniques are designed to handle multispeaker conversations in real-world meeting scenarios with high speaker-overlap ratios and under heavy reverberant and noisy condition. First, for data preparation and augmentation in training TS-VAD models, speech data containing both real meetings and simulated indoor conversations are used. Second, in refining results obtained after TS-VAD based decoding, we perform a series of post-processing steps to improve the VAD results needed to reduce diarization error rates (DERs). Tested on the ALIMEETING corpus, the newly released Mandarin meeting dataset used in M2MeT, we demonstrate that our proposed system can decrease the DER by up to 66.55/60.59% relatively when compared with classical clustering based diarization on the Eval/Test set.

show abstract

Section: Clustering-based Speaker Diarizationmentioning

confidence: 99%

The USTC-Ximalaya system for the ICASSP 2022 multi-channel multi-party meeting transcription (M2MeT) challenge

He¹,

Xiang²,

Zhou³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Within-show speaker diarization is a very active field of research in which deep learning approaches have recently reach the performance of more classic methods based on Hierarchical Agglomerative Clustering (HAC) [1], K-Means or Spectral Clustering [11] or variational-bayesian modeling [12]. Recent neural approaches have shown tremendous improvement for audio recordings involving a limited number of speakers [13][14][15][16]; however, the inherent difficulty of speaker permutation, often addressed using a PIT loss (permutation invariant training) does not allow current neural end-to-end systems to perform as well as HAC based approaches when dealing with a large number of speaker per audio file (>7) as explained in [17].…”

Section: Related Workmentioning

confidence: 99%

Active Correction for Incremental Speaker Diarization of a Collection with Human in the Loop

et al. 2022

View full text Add to dashboard Cite

State of the art diarization systems now achieve decent performance but those performances are often not good enough to deploy them without any human supervision. Additionally, most approaches focus on single audio files while many use cases involving multiple recordings with recurrent speakers require the incremental processing of a collection. In this paper, we propose a framework that solicits a human in the loop to correct the clustering by answering simple questions. After defining the nature of the questions for both single file and collection of files, we propose two algorithms to list those questions and associated stopping criteria that are necessary to limit the work load on the human in the loop. Experiments performed on the ALLIES dataset show that a limited interaction with a human expert can lead to considerable improvement of up to 36.5% relative diarization error rate (DER) for single files and 33.29% for a collection.

show abstract

“…E 3 FS 3 will include a diarization tool based on the VBx algorithm, which had the best performance in the DIHARD’19 diarization challenge [ [6] , [7] , [8] , [9] ]; however, all data that were used for training and validation in the context of the present paper were supplied already diarized, so this part of the system was not validated as part of the E 3 FS 3 α validation.…”

Section: E 3 Fs 3 Core Software...mentioning

confidence: 99%

Validations of an alpha version of the E3 Forensic Speech Science System (E3FS3) core software tools

Weber

Enzinger

Labrador

et al. 2022

Forensic Science International: Synergy

View full text Add to dashboard Cite

Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: Theory, implementation and analysis on standard tasks

Cited by 102 publications

References 15 publications

The USTC-Ximalaya system for the ICASSP 2022 multi-channel multi-party meeting transcription (M2MeT) challenge

The USTC-Ximalaya system for the ICASSP 2022 multi-channel multi-party meeting transcription (M2MeT) challenge

Active Correction for Incremental Speaker Diarization of a Collection with Human in the Loop

Validations of an alpha version of the E3 Forensic Speech Science System (E3FS3) core software tools

Contact Info

Product

Resources

About