Clustering is a very well-known technique in data mining. One of the most widely used clustering techniques is the K-means algorithm. It is very popular because it is conceptually simple, computationally fast and memory efficient. In this paper, the role of noise points in limiting the efficacy of k-means algorithm was presented, by analyzing them within the purview of sum-of-squared error (SSE), which continues to remain the undisputedly popular validation method of K-means algorithm. Experimental studies were made with synthetic data sets of multiple dimensions and cluster sizes. Numerous noise points were barraged to the K clusters and the effect of noise distance on SSE was considered. On analyzing the results, we infer that the distance of noise to the cluster center influences SSE.This correlative study holds much significance, as the k means algorithm assumes that the number of clusters in the database is perceived in anticipation. Apparently, this is not necessarily true in real-world applications. The study probes the pathognomonic role of noise points in the clustering outcome, which in the process will serve to provide with better results in real-world applications.
Abstract:Clustering evolves as an indigenous unsupervised data mining problem. This paper presents an estimation model, when new unclustered information is fed to the clustered system. The soul of this paper is to test the accuracy of the built Inter Cluster Movement Estimation (ICME) model with multi-dimensional clusters. Clusters of varying sizes and dimensions were constructed from synthetic and real data sets taken from UCI repository. On experimental analysis, the accuracy of the approximation model is found to increase with increased cluster sizes of multiple dimensions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.