2001
DOI: 10.1111/1467-9868.00293
|View full text |Cite
|
Sign up to set email alerts
|

Estimating the Number of Clusters in a Data Set Via the Gap Statistic

Abstract: We propose a method (the`gap statistic') for estimating the number of clusters (groups) in a set of data. The technique uses the output of any clustering algorithm (e.g. K-means or hierarchical), comparing the change in within-cluster dispersion with that expected under an appropriate reference null distribution. Some theory is developed for the proposal and a simulation study shows that the gap statistic usually outperforms other methods that have been proposed in the literature.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

15
3,554
0
48

Year Published

2005
2005
2023
2023

Publication Types

Select...
8
2

Relationship

0
10

Authors

Journals

citations
Cited by 4,842 publications
(3,617 citation statements)
references
References 18 publications
15
3,554
0
48
Order By: Relevance
“…In our application, the same result obtains if we use the gap statistic developed by Tibshirani, Walther and Hastie (2001). 11 "If they want to take us out from every one of the (congressional) Committees, let them do it; we have the streets of the people" -our translation.…”
Section: The Conditional Effects Of Constitutions On Policysupporting
confidence: 53%
“…In our application, the same result obtains if we use the gap statistic developed by Tibshirani, Walther and Hastie (2001). 11 "If they want to take us out from every one of the (congressional) Committees, let them do it; we have the streets of the people" -our translation.…”
Section: The Conditional Effects Of Constitutions On Policysupporting
confidence: 53%
“…The maximum number of clusters was defined to be equal to the number of variables. Different criteria exist to identify the optimal number of clusters in a given data set (Tibshirani et al, 2001;Yan, 2005) but in this study, the aim was to identify cluster patterns at different stages of clustering instead of identifying one optimal number of clusters. Relevant cluster stages were defined as stages resulting in a high increase in variance explained compared to the neighbouring splits (local maxima).…”
Section: Methodsmentioning
confidence: 99%
“…Prior literature (e.g., Smyth 2000; Still and Bielek 2004;Tibshirani et al 2001) recommends an iterative process to determine the optimal number of clusters. Consistent with this idea, we experimentally varied the number of clusters and repeated the three steps in Figure 2 until the error rate of the decision tree model in Step 3 was minimized.…”
Section: Classification Modelmentioning
confidence: 99%