Cluster validation aims to both evaluate the results of clustering algorithms and predict the number of clusters. It is usually achieved using several indexes. Traditional internal clustering validation indexes (CVIs) are mainly based in computing pairwise distances which results in a quadratic complexity of the related algorithms. The existing CVIs cannot handle large data sets properly and need to be revisited to take account of the ever-increasing data set volume. Therefore, design of parallel and distributed solutions to implement these indexes is required. To cope with this issue, the authors propose two parallel and distributed models for internal CVIs namely for Silhouette and Dunn indexes using MapReduce framework under Hadoop. The proposed models termed as MR_Silhouette and MR_Dunn have been tested to solve both the issue of evaluating the clustering results and identifying the optimal number of clusters. The results of experimental study are very promising and show that the proposed parallel and distributed models achieve the expected tasks successfully.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.