2009
DOI: 10.1109/tasl.2009.2015089
|View full text |Cite
|
Sign up to set email alerts
|

Prosodic and other Long-Term Features for Speaker Diarization

Abstract: Abstract-Speaker diarization is defined as the task of determining "who spoke when" given an audio track and no other prior knowledge of any kind. The following article shows how a state-of-the-art speaker diarization system can be improved by combining traditional short-term features (MFCCs) with prosodic and other long-term features. First, we present a framework to study the speaker discriminability of 70 different long-term features. Then, we show how the top-ranked long-term features can be combined with … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
64
0
1

Year Published

2010
2010
2016
2016

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 65 publications
(66 citation statements)
references
References 22 publications
1
64
0
1
Order By: Relevance
“…In other words, speaker clusters obtained with the bottom-up approach tend to be poorly normalized. This is particularly true when short-term cepstral-based features are used, though recent work with prosodic features have potential to discourage such behavior [10].…”
Section: Normalization and Discriminationmentioning
confidence: 99%
“…In other words, speaker clusters obtained with the bottom-up approach tend to be poorly normalized. This is particularly true when short-term cepstral-based features are used, though recent work with prosodic features have potential to discourage such behavior [10].…”
Section: Normalization and Discriminationmentioning
confidence: 99%
“…A dynamic programming procedure is used to find the optimal one-to-one mapping between the hypothesis and the ground truth segments so that the total overlap between the reference Category Feature ID Short description pitch f0 median median of the pitch pitch f0 min min of the pitch pitch f0 mean curve mean of the pitch tier formants f4 stddev std dev of the 4th formant formants f4 min min of the 4th formant formants f4 mean mean of the 4th formant formants f5 stddev std dev of the 5th formant formants f5 min min of the 5th formant formants f5 mean mean of the 5th formant harmonic harm mean mean of the harmonicsto-noise ratio formant form disp mean mean of the formant dispersion pitch pp period mean mean of the pointprocess of the periodicity contour Table 1. The 12 prosodic features used in the proposed initialization method (see also [2]). …”
Section: Baseline Systemmentioning
confidence: 99%
“…In this section, we present another method to estimate k (see Figure 1) and propose to use the aforementioned linear regression to adapt g accordingly. The presented method estimates the number of initial clusters and also provides a non-uniform initialization for the agglomerative clustering procedure based on the long-term feature study and ranking presented in [2]. Derived from the ranking in [2], the 12 topranked prosodic features (listed in Table 1) are extracted on all the speech regions (speech/non-speech detector, see [9]) in the recording.…”
Section: Automatic Parameter Estimation and Non-uniform Initializationmentioning
confidence: 99%
See 2 more Smart Citations