Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1750
|View full text |Cite
|
Sign up to set email alerts
|

Neural Speech Turn Segmentation and Affinity Propagation for Speaker Diarization

Abstract: Speaker diarization is the task of determining "who speaks when" in an audio stream. Most diarization systems rely on statistical models to address four sub-tasks: speech activity detection (SAD), speaker change detection (SCD), speech turn clustering, and re-segmentation. First, following the recent success of recurrent neural networks (RNN) for SAD and SCD, we propose to address re-segmentation with Long-Short Term Memory (LSTM) networks. Then, we propose to use affinity propagation on top of neural speaker … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
32
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 34 publications
(32 citation statements)
references
References 13 publications
0
32
0
Order By: Relevance
“…The proposed pipeline for speaker diarization. The baseline incorporates end-to-end neural voice activity detection, speaker change detection, and speaker embeddings, with clustering performed via affinity propagation [11]. It is available in pyannote.audio toolkit [12].…”
Section: Principlementioning
confidence: 99%
“…The proposed pipeline for speaker diarization. The baseline incorporates end-to-end neural voice activity detection, speaker change detection, and speaker embeddings, with clustering performed via affinity propagation [11]. It is available in pyannote.audio toolkit [12].…”
Section: Principlementioning
confidence: 99%
“…Each time step is assigned to the class (non-speech or one of the κ speakers) with highest prediction scores. This essentially implements a version of the re-segmentation approach originally described in [15] where it was found that = 20 is a reasonable number of epochs. This re-segmentation step may be extended to also assign the class with the second highest prediction score to overlapped speech regions [14].…”
Section: Re-segmentationmentioning
confidence: 99%
“…The threshold varies from −100 to 100 with a step of 10. The results are then evaluated (see 3.4) to select the best threshold ( [27] is an example of the use of S4D with this state-of-the-art approach). Table 1 illustrates the best results for this system.…”
Section: How To Develop a Broadcast News Diarization Systemmentioning
confidence: 99%