2014 IEEE Spoken Language Technology Workshop (SLT) 2014
DOI: 10.1109/slt.2014.7078608
|View full text |Cite
|
Sign up to set email alerts
|

Artificial neural network features for speaker diarization

Abstract: Speaker diarization finds contiguous speaker segments in an audio recording and clusters them by speaker identity, without any a-priori knowledge. Diarization is typically based on short-term spectral features such as Mel-frequency cepstral coefficients (MFCCs). Though these features carry average information about the vocal tract characteristics of a speaker, they are also susceptible to factors unrelated to the speaker identity. In this study, we propose an artificial neural network (ANN) architecture to lea… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
50
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 51 publications
(50 citation statements)
references
References 10 publications
0
50
0
Order By: Relevance
“…The full training set which contains 135 meetings with 149 speakers was used which is further split into 90% for model training and 10% cross validation set for hyper-parameter tuning. For evaluation, instead of using the full dev and eval sets, we use the meetings recorded at IDIAP, Edinburgh and Brno which are the sets frequently used for evaluation of speaker diarisation [7,20], and which are more consistent with our observations on other datasets. The partition of the dataset is shown in Table 1.…”
Section: Data Preparationmentioning
confidence: 99%
“…The full training set which contains 135 meetings with 149 speakers was used which is further split into 90% for model training and 10% cross validation set for hyper-parameter tuning. For evaluation, instead of using the full dev and eval sets, we use the meetings recorded at IDIAP, Edinburgh and Brno which are the sets frequently used for evaluation of speaker diarisation [7,20], and which are more consistent with our observations on other datasets. The partition of the dataset is shown in Table 1.…”
Section: Data Preparationmentioning
confidence: 99%
“…[1,2] Over the past years with the introduction of deep learning, diarisation systems have greatly improved in performance [3][4][5]. Going from the use of i-vectors to d-vectors, for representing a segment of speech for a single speaker, has been a contributing factor to progress [6,7]. The second factor has been the use of more sophisticated clustering processes, which are more appropriate for diarisation [8,9].…”
Section: Introductionmentioning
confidence: 99%
“…Recently, deep neural networks (DNN) have been successfully applied to various speaker recognition tasks [22,7,45,37,28,24], reaching and exceeding state of the art results of classic GMM- [35] or i-vector-based [8] systems. With few exceptions, systems based on convolutional neural network (CNN) architectures have been used on spectrograms for their unprecedented performance on visual recognition tasks [21].…”
Section: Introductionmentioning
confidence: 99%