It is critical to evaluate the quality of clusters for most cluster analysis. A number of cluster validity indexes have been proposed, such as the Silhouette and Davies-Bouldin indexes. However, these validity indexes cannot be used to process clusters with arbitrary shapes. Some researchers employ graph-based distance to cluster nonspherical data sets, but the computation of graph-based distances between all pairs of points in a data set is time-consuming. A potential solution is to select some representative points. Inspired by this idea, we propose a novel Local Cores-based Cluster Validity (LCCV) index to improve the performance of Silhouette index. Local cores, with local maximum density, are selected as representative points. Since graph-based distance is used to evaluate the dissimilarity between local cores, the LCCV index is effective for obtaining the optimal cluster number for data sets containing clusters with arbitrary shapes. Moreover, a hierarchical clustering algorithm based on the LCCV index is proposed. The experimental results on synthetic and real data sets indicate that the new index outperforms existing ones.
Cluster analysis aims at classifying objects into categories on the basis of their similarity and has been widely used in many areas such as pattern recognition and image processing. In this paper, we propose a novel clustering algorithm called QCC mainly based on the following ideas: the density of a cluster center is the highest in its K nearest neighborhood or reverse K nearest neighborhood, and clusters are divided by sparse regions. Besides, we define a novel concept of similarity between clusters to solve the complex-manifold problem. In experiments, we compare the proposed algorithm QCC with DBSCAN, DP and DAAP algorithms on synthetic and real-world datasets. Results show that QCC performs the best, and its superiority on clustering non-spherical data and complex-manifold data is especially large. Keywords Clustering • Center • Similarity • Neighbor • Manifold 1 Introduction Clustering is one of primary methods in data mining and data analysis. It aims at classifying objects into categories or clusters, on the basis of their similarity. The clusters are collections Editor: João Gama.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.