The Third DIHARD Diarization Challenge

Ryant, Neville; Singh, Prachi; Krishnamohan, Venkat; Varma, Rajat; Church, Kenneth; Cieri, Christopher; Du, Jun; Ganapathy, Sriram; Liberman, Mark

doi:10.21437/interspeech.2021-1208

Cited by 60 publications

(41 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…DIVE establishes a new state-of-the-art on the standard CALLHOME benchmark, with 6.7% DER compared to 7.8% for the best alternative. In the future, we aim to address experimental settings with variable number of speakers and noisier acoustic conditions [38], [39].…”

Section: Discussionmentioning

confidence: 99%

DIVE: End-to-end Speech Diarization via Iterative Speaker Embedding

Zeghidour¹,

Teboul²,

Grangier³

2021

Preprint

View full text Add to dashboard Cite

We introduce DIVE, an end-to-end speaker diarization algorithm. Our neural algorithm presents the diarization task as an iterative process: it repeatedly builds a representation for each speaker before predicting the voice activity of each speaker conditioned on the extracted representations. This strategy intrinsically resolves the speaker ordering ambiguity without requiring the classical permutation invariant training loss. In contrast with prior work, our model does not rely on pretrained speaker representations and optimizes all parameters of the system with a multi-speaker voice activity loss. Importantly, our loss explicitly excludes unreliable speaker turn boundaries from training, which is adapted to the standard collar-based Diarization Error Rate (DER) evaluation. Overall, these contributions yield a system redefining the state-of-the-art on the standard CALLHOME benchmark, with 6.7% DER compared to 7.8% for the best alternative.

show abstract

Section: Discussionmentioning

confidence: 99%

DIVE: End-to-end Speech Diarization via Iterative Speaker Embedding

Zeghidour¹,

Teboul²,

Grangier³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…For diarization, our experimental setup is based on the baseline system created by the organizers (Ryant et al 2021). We have used the toolkit 4 with the same frame-level acoustic features, embedding extractor, scoring method, etc.…”

Section: Methodsmentioning

confidence: 99%

“…In this work, we are concerned with only identifying the various domains of spoken documents and hence have only considered the Task 1 of DIHARD III (Ryant et al 2020) where the reference SAD was given. The diarization baseline provided with this challenge (Ryant et al 2021), which was based on one of the submissions of the predecessor challenge (Singh et al 2019), was used to benchmark our proposed SD system.…”

Section: Baseline Diarization Systemmentioning

confidence: 99%

Robust Acoustic Domain Identification with its Application to Speaker Diarization

Kumar,

Waldekar,

Sahidullah

et al. 2022

Preprint

View full text Add to dashboard Cite

With the rise in multimedia content over the years, more variety is observed in the recording environments of audio. An audio processing system might benefit when it has a module to identify the acoustic domain at its front-end. In this paper, we demonstrate the idea of acoustic domain identification (ADI) for speaker diarization. For this, we first present a detailed study of the various domains of the third DIHARD challenge highlighting the factors that differentiated them from each other. Our main contribution is to develop a simple and efficient solution for ADI. In the present work, we explore speaker embeddings for this task. Next, we integrate the ADI module with the speaker diarization framework of the DIHARD III challenge. The performance substantially improved over that of the baseline when the thresholds for agglomerative hierarchical clustering were optimized according to the respective domains. We achieved a relative improvement of more than 5% and 8% in DER for core and full conditions, respectively, on Track 1 of the DIHARD III evaluation set.

show abstract

“…Speaker diarization in the multi-party scenario is still a challenging task [1][2][3]. Diarization systems are subject to severe performance degradation when several speakers are overlapping, which may naturally occur in spontaneous speech.…”

Section: Introductionmentioning

confidence: 99%

Microphone Array Channel Combination Algorithms for Overlapped Speech Detection

Mariotte¹,

Larcher²,

Montrésor³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

Overlapped speech occurs when multiple speakers are simultaneously active. This may lead to severe performance degradation in automatic speech processing systems such as speaker diarization. Overlapped speech detection (OSD) aims at detecting time segments in which several speakers are simultaneously active. Recent deep neural network architectures have shown impressive results in the close-talk scenario. However, performance tends to deteriorate in the context of distant speech. Microphone arrays are often considered under these conditions to record signals including spatial information. This paper investigates the use of the self-attention channel combinator (SACC) system as a feature extractor for OSD. This model is also extended in the complex space (cSACC) to improve the interpretability of the approach. Results show that distant OSD performance with self-attentive models gets closer to the nearfield condition. A detailed analysis of the cSACC combinationweights is also conducted showing that the self-attention module focuses attention on the speakers' direction.

show abstract

The Third DIHARD Diarization Challenge

Cited by 60 publications

References 37 publications

DIVE: End-to-end Speech Diarization via Iterative Speaker Embedding

DIVE: End-to-end Speech Diarization via Iterative Speaker Embedding

Robust Acoustic Domain Identification with its Application to Speaker Diarization

Microphone Array Channel Combination Algorithms for Overlapped Speech Detection

Contact Info

Product

Resources

About