Abstract.Clustering very large datasets is a challenging problem for data mining and processing. MapReduce is considered as a powerful programming framework which significantly reduces executing time by dividing a job into several tasks and executes them in a distributed environment. K-Means which is one of the most used clustering methods and K-Means based on MapReduce is considered as an advanced solution for very large dataset clustering. However, the executing time is still an obstacle due to the increasing number of iterations when there is an increase of dataset size and number of clusters. This paper presents a new approach for reducing the number of iterations of K-Means algorithm which can be applied to very large dataset clustering. This new method can reduce up to 30 percent of iterations while maintaining up to 98 percent accuracy when tested with several very large datasets with real data type attributes. Based on the significant results from the experiments, this paper proposes a new fast K-Means clustering method for very large datasets based on MapReduce combined with a new cutting method (abbreviated to FMR.K-Means).
This paper presents a new clustering algorithm, called Cell-MST-Based Method that is a combination of a Cellbased method and Minimum Spanning Tree based (MST-based) methods. The algorithm is dedicated for Big Datasets on a limited memory computer, especially for thin big datasets which have a small number of attributes but a very large number of instances. Firstly, a Cell-based method converts a big dataset to a small grid of cells in such a way that the required memory to store an edgeweighted graph created from the grid which is less than the available memory of a computer. Then MST-based methods obtain an optimal threshold, estimate the number of clusters and determine the initial centroids. The proposed Cell-MST-based methods can reduce more than 99% of the required memory of the previous similarity-based and MST-based cluster number estimation methods. Moreover, this new Cell-MST-based method also outperforms the quantization error modeling method in terms of executing time and estimated accurate level.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.