Finding clusters in data is a challenging problem especially when the clusters are being of widely varied shapes, sizes, and densities. Herein a new scalable clustering technique which addresses all these issues is proposed. In data mining, the purpose of data clustering is to identify useful patterns in the underlying dataset. Within the last several years, many clustering algorithms have been proposed in this area of research. Among all these proposed methods, density clustering methods are the most important due to their high ability to detect arbitrary shaped clusters. Moreover these methods often show good noise-handling capabilities, where clusters are defined as regions of typical densities separated by low or no density regions. In this paper, we aim at enhancing the well-known algorithm DBSCAN, to make it scalable and able to discover clusters from uneven datasets in which clusters are regions of homogenous densities. We achieved the scalability of the proposed algorithm by using the k-means algorithm to get initial partition of the dataset, applying the enhanced DBSCAN on each partition, and then using a merging process to get the actual natural number of clusters in the underlying dataset. This means the proposed algorithm consists of three stages. Experimental results using synthetic datasets show that the proposed clustering algorithm is faster and more scalable than the enhanced DBSCAN counterpart
Abstract-Data clustering is very active and attractive research area in data mining; there are dozens of clustering algorithms that have been published. Any clustering algorithm aims to classify data points according to some criteria. DBSCAN is the most famous and well-studied algorithm. Clusters are recorded as dense regions separated from each other by spars regions. It is based on enumerating the points in Epsneighborhood of each point. This paper proposes a clustering method based on k-nearest neighbors and local density of objects in data; that is computed as the total of distances to the most near points that affected on it. Cluster is defined as a continuous region that has points within local densities fall between minimum local density and maximum local density. The proposed method identifies clusters of different shapes, sizes, and densities. It requires only three parameters; these parameters take only integer values. So it is easy to determine. The experimental results demonstrate the superior of the proposed method in identifying varied density clusters.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.