Pyannote.Audio: Neural Building Blocks for Speaker Diarization

Bredin, Hervé; Yin, Ruiqing; Coria, Juan Manuel; Gelly, Grégory; Korshunov, Pavel; Lavechin, Marvin; Fustes, Diego; Titeux, Hadrien; Bouaziz, Wassim; Gill, Marie-Philippe

doi:10.1109/icassp40776.2020.9052974

Cited by 206 publications

(151 citation statements)

References 18 publications

Supporting

Mentioning

128

Contrasting

Order By: Relevance

“…The French dataset was created by selecting the French data from the Librivox website 2 , and Mandarin from the Magic-Data dataset [21]. The recordings were cut into "utterance"-like segments using pyannote's Voice Activity Detector [22]. Both datasets had a similar number of speakers and total duration as Libri-Speech (250 speakers, 76h and 80h respectively).…”

Section: Datasets and Evaluation Measuresmentioning

confidence: 99%

Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Kharitonov

Rivière

Synnaeve

et al. 2021

2021 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

Contrastive Predictive Coding (CPC), based on predicting future segments of speech based on past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library and find that applying augmentation in the past is generally more efficient and yields better performances than other methods. We find that a combination of pitch modification, additive noise and reverberation substantially increase the performance of CPC (relative improvement of 18-22%), beating the reference Libri-light results with 600 times less data. Using an out-of-domain dataset, time-domain data augmentation can push CPC to be on par with the state of the art on the Zero Speech Benchmark 2017. We also show that time-domain data augmentation consistently improves downstream limited-supervision phoneme classification tasks by a factor of 12-15% relative.

show abstract

Section: Datasets and Evaluation Measuresmentioning

confidence: 99%

Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Kharitonov

Rivière

Synnaeve

et al. 2021

2021 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

show abstract

“…We extract 4 subsets of 80 hours from Libri-light-600, the 'small' cut of the dataset containing approximately 600h of speech. In the first two subsets, the non-speech parts were filtered out using Voice Activity Detection (VAD) computed with pyannote.audio [37]. The LL80-p subset samples the files uniformly, and ends up with a power law distribution of speakers.…”

Section: Exp 1 How Much Does Noisy Data Hurt?mentioning

confidence: 99%

Towards Unsupervised Learning of Speech Features in the Wild

Rivière

Dupoux

2021

2021 IEEE Spoken Language Technology Workshop (SLT)

View full text Add to dashboard Cite

Recent work on unsupervised contrastive learning of speech representation has shown promising results, but so far has mostly been applied to clean, curated speech datasets. Can it also be used with unprepared audio data "in the wild"? Here, we explore three potential problems in this setting: (i) presence of non-speech data, (ii) noisy or low quality speech data, and (iii) imbalance in speaker distribution. We show that on the Libri-light train set, which is itself a relatively clean speech-only dataset, these problems combined can already have a performance cost of up to 30% relative for the ABX score. We show that the first two problems can be alleviated by data filtering, with voice activity detection selecting speech segments, while perplexity of a model trained with clean data helping to discard entire files. We show that the third problem can be alleviated by learning a speaker embedding in the predictive branch of the model. We show that these techniques build more robust speech features that can be transferred to an ASR task in the low resource setting.

show abstract

“…Exp-1 evaluates the performance of the proposed method as a complete segmentation system. The proposed method is compared against two baselines: baseline-1, previously presented by the authors in [49] and baseline-2, a state-of-the-art deep learning approach presented in [51].…”

Section: A Exp-1: Full Segmentation Using Proposed Systemmentioning

confidence: 99%

“…The pyannote.audio [51] framework was used to train a neural network f : X → y that matches a feature sequence X to the corresponding label sequence y. If there is a speaker change at frame t then y t = 1 otherwise y t = 0.…”

Section: A Exp-1: Full Segmentation Using Proposed Systemmentioning

confidence: 99%

Overlapping Speaker Segmentation Using Multiple Hypothesis Tracking of Fundamental Frequency

Hogg

Evers

Moore

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

This paper demonstrates how the harmonic structure of voiced speech can be exploited to segment multiple overlapping speakers in a speaker diarization task. We explore how a change in the speaker can be inferred from a change in pitch. We show that voiced harmonics can be useful in detecting when more than one speaker is talking, such as during overlapping speaker activity. A novel system is proposed to track multiple harmonics simultaneously, allowing for the determination of onsets and end-points of a speaker's utterance in the presence of an additional active speaker. This system is bench-marked against a segmentation system from the literature that employs a bidirectional long short term memory network (BLSTM) approach and requires training. Experimental results highlight that the proposed approach outperforms the BLSTM baseline approach by 12.9% in terms of HIT rate for speaker segmentation. We also show that the estimated pitch tracks of our system can be used as features to the BLSTM to achieve further improvements of 1.21% in terms of coverage and 2.45% in terms of purity.

show abstract

Pyannote.Audio: Neural Building Blocks for Speaker Diarization

Cited by 206 publications

References 18 publications

Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

Towards Unsupervised Learning of Speech Features in the Wild

Overlapping Speaker Segmentation Using Multiple Hypothesis Tracking of Fundamental Frequency

Contact Info

Product

Resources

About