In cluster analysis, finding the number of clusters, K, for a given dataset is an important yet very tricky task, simply for the facts that there is no universally accepted correct or wrong answer for most real world problems and it all depends on the context and purpose of a cluster study. Numerous methods have been developed for estimating K, but most are not widely used in practice due to their poor performance. Thus, it is still quite common that human user is required to select a specific value or a range for K for many clustering methods before they are used. Inappropriate predetermination for K can result in poor clustering results. This paper presents a new method for estimating the most probable number of clusters automatically. It firstly calculates the length of constant similarity intervals, L, and then considers the longest ones as the representations of the most probable numbers of the clusters under the set context and the chosen similarity measure. An error function is defined to measure and evaluate the goodness of estimations. The proposed method has been tested on 3 synthetic datasets and 8 realworld benchmark datasets, and compared with some other popular methods including particularly the TwoStep implemented in IBM/SPSS Modeler software package. The experimental results showed that the proposed method is able to find the "desired" predominant number of clusters for all the simulated datasets and most of the benchmark datasets, and the statistical tests indicate that our method is significantly better. Estimating the predominant number of clusters in a dataset (b) Two Possible Clusters (a) Original Dataset (c) Four Possible Clusters (d) Five Possible Clusters Fig. 1. Illustration of different clustering results with different number of clusters for a given dataset when different purpose and clustering methods are used. The sign (+) represents the centre of each cluster.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.