ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9052974
|View full text |Cite
|
Sign up to set email alerts
|

Pyannote.Audio: Neural Building Blocks for Speaker Diarization

Abstract: We introduce pyannote.audio, an open-source toolkit written in Python for speaker diarization. Based on PyTorch machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines. pyannote.audio also comes with pre-trained models covering a wide range of domains for voice activity detection, speaker change detection, overlapped speech detection, and speaker embedding -reaching state-of-the-art performanc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
128
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 206 publications
(151 citation statements)
references
References 18 publications
0
128
0
Order By: Relevance
“…The French dataset was created by selecting the French data from the Librivox website 2 , and Mandarin from the Magic-Data dataset [21]. The recordings were cut into "utterance"-like segments using pyannote's Voice Activity Detector [22]. Both datasets had a similar number of speakers and total duration as Libri-Speech (250 speakers, 76h and 80h respectively).…”
Section: Datasets and Evaluation Measuresmentioning
confidence: 99%
“…The French dataset was created by selecting the French data from the Librivox website 2 , and Mandarin from the Magic-Data dataset [21]. The recordings were cut into "utterance"-like segments using pyannote's Voice Activity Detector [22]. Both datasets had a similar number of speakers and total duration as Libri-Speech (250 speakers, 76h and 80h respectively).…”
Section: Datasets and Evaluation Measuresmentioning
confidence: 99%
“…We extract 4 subsets of 80 hours from Libri-light-600, the 'small' cut of the dataset containing approximately 600h of speech. In the first two subsets, the non-speech parts were filtered out using Voice Activity Detection (VAD) computed with pyannote.audio [37]. The LL80-p subset samples the files uniformly, and ends up with a power law distribution of speakers.…”
Section: Exp 1 How Much Does Noisy Data Hurt?mentioning
confidence: 99%
“…Exp-1 evaluates the performance of the proposed method as a complete segmentation system. The proposed method is compared against two baselines: baseline-1, previously presented by the authors in [49] and baseline-2, a state-of-the-art deep learning approach presented in [51].…”
Section: A Exp-1: Full Segmentation Using Proposed Systemmentioning
confidence: 99%
“…The pyannote.audio [51] framework was used to train a neural network f : X → y that matches a feature sequence X to the corresponding label sequence y. If there is a speaker change at frame t then y t = 1 otherwise y t = 0.…”
Section: A Exp-1: Full Segmentation Using Proposed Systemmentioning
confidence: 99%