Ensemble clustering is an efficient unsupervised learning technique that has attracted a lot of attention. The purpose of this technique is to aggregate the results of several basic clustering algorithms in order to create a better clustering. This is not only possible, but has been developed with many techniques in recent years. However, there are still challenges such as similarity measure, agreement function and how to implement clustering algorithms. To address these challenges, this article proposes an ensemble clustering algorithm based on the consistency cluster consensus approach and the MapReduce model. In addition, the proposed algorithm is equipped with a new membership similarity measure. The consistency cluster consensus is defined as a new agreement function for the consensus of the results of the basic clustering methods. Besides, the proposed similarity measure consists of two factors: one is cluster similarity and the another is membership similarity. The process of the proposed ensemble clustering method is summarized in four steps. In the first step, partitions are generated by applying a number of individual clustering algorithms to the data. The second step is to convert the primary clusters from the partitions to a binary representation. The third step is to identify the candidate primary clusters for the consensus, where a subset with highest similarity between the clusters is emphasized. In the fourth step, the consistency cluster consensus is performed through a new agreement function. In addition, we use the MapReduce model to implement basic clustering algorithms to reduce computational complexity. Our method is evaluated on several real-world datasets and their performance was compared with other clustering ensemble methods, such as TWCE and RDPC-DSS. Experiments show that our method is more efficient and accurate and is 7% to 15% better than the best method compared.
INTRODUCTIONData mining is the process of discovering regular and hidden patterns and processes in big data and distributed using a wide range of algorithms based on mathematical and statistical sciences. 1,2 These algorithms are usually applied to numeric and nontextual values and textual algorithms are used for textual data. Data mining uses sciences such as artificial intelligence, machine learning, statistics, operational research and database management to build models