In this paper an efficient distance estimation and centroid selection based on k-means clustering for small and large dataset. Data pre-processing was performed first on the dataset. For the complete study and analysis PIMA Indian diabetes dataset was considered. After pre-processing distance and centroid estimation was performed. It includes initial selection based on randomization and then centroids updations were performed till the iterations or epochs determined. Distance measures used here are Euclidean distance (Ed), Pearson Coefficient distance (PCd), Chebyshev distance (Csd) and Canberra distance (Cad). The results indicate that all the distance algorithms performed approximately well in case of clustering but in terms of time Cad outperforms in comparison to other algorithms.
The one report of the world health organization shows that diabetes will be the seventh leading cause of death in 2030 worldwide. Different research persons on the globe have investigated it on different parameters, and the investigation is going on for the early-stage detection. The paper’s main objective is to detail the study and explain the practical and potential framework for forecasting diabetes based on the dataset presented. This detailed study is useful in finding out the research gaps so that upcoming research provides us an efficient method to diagnose diabetes in the early stage with the help of data mining. This analysis also gives us the constraint investigation along with the knowledge of the distinctive 8 the way of employing the categorization framework.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.