Parallel K-Means Clustering Algorithm on DNA Dataset

Othman, Fadil; Abdullah, Rosni; Rashid, Nur’Aini Abdul; Salam, Rosalina Abdul

doi:10.1007/978-3-540-30501-9_54

Cited by 10 publications

(4 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…A parallel implementation of the k-means clustering algorithm on a cluster of personal computers (PCs) was described in [27]. The proposed algorithm is parallelised based on the inherent data-parallelism especially in the distance calculation and centroid update operations for DNA dataset.…”

Section: Related Workmentioning

confidence: 99%

Efficient algorithm for big data clustering on single machine

Alguliyev

Alıguliyev

Sukhostat

2020

CAAI trans. intell. technol.

View full text Add to dashboard Cite

Big data analysis requires the presence of large computing powers, which is not always feasible. And so, it became necessary to develop new clustering algorithms capable of such data processing. This study proposes a new parallel clustering algorithm based on the k‐means algorithm. It significantly reduces the exponential growth of computations. The proposed algorithm splits a dataset into batches while preserving the characteristics of the initial dataset and increasing the clustering speed. The idea is to define cluster centroids, which are also clustered, for each batch. According to the obtained centroids, the data points belong to the cluster with the nearest centroid. Real large datasets are used to conduct the experiments to evaluate the effectiveness of the proposed approach. The proposed approach is compared with k‐means and its modification. The experiments show that the proposed algorithm is a promising tool for clustering large datasets in comparison with the k‐means algorithm.

show abstract

Section: Related Workmentioning

confidence: 99%

Efficient algorithm for big data clustering on single machine

Alguliyev

Alıguliyev

Sukhostat

2020

CAAI trans. intell. technol.

View full text Add to dashboard Cite

show abstract

“…Many researchers tried to parallelize the k-means algorithm using either parallel or distributed systems. In [5], the authors used the MapReduce framework and the Hadoop Distributed File System (HDFS) in order to distribute the computation workload among the nodes of the system. The proposed techniques applied the k-means locally on the nodes of the system and the results produced were used by the master node in order to produce global centroids applying again the k-means on them.…”

Section: Review Of Litera Turementioning

confidence: 99%

Enhancement of Parallel K-Means algorithm

Mathew¹,

Vijayakumar

2015

2015 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS)

View full text Add to dashboard Cite

This paper mainly focuses on identifying the limitations of the K-Means algorithm and to propose the parallelization of the K Means using Firefly based clustering method. The new parallel architecture can handle large number of clusters. Modified Firefly algorithm can be used to find initial optimal cluster centroid and then K-Means algorithm with optimized centroid can be used to refine them and improve clustering accuracy. The final convergence issue is also addressed and solved to a great extent. The design methodology is explained in the subsequent sections. Finally, modified algorithm is compared with Parallel K-Means. It is demonstrated with experiments and it has been found that the performance of modified algorithm is better than that of the existing algorithm. Four typical benchmark data sets from the UCI machine learning repository are used to demonstrate the results of the techniques Index Terms-Clustering, Parallel K -Means, Firefly Algorithm, large data

show abstract

“…Although the complexity of k-means is not high, O(I knm), for I iterations, K clusters, n instances, and m features, this implementation may require a long time if the number of iterations for convergence is large. Othman et al [101] developed a similar solution for clustering DNA data. There are many other parallel versions of k-means based on the principle of data distribution [73].…”

Section: Parallelizationmentioning

confidence: 99%

Scaling up data mining algorithms: review and taxonomy

García-Pedrajas

Haro-García

2012

Prog Artif Intell

View full text Add to dashboard Cite

The overwhelming amount of data that are now available in any field of research poses new problems for data mining and knowledge discovery methods. Due to this huge amount of data, most of the current data mining algorithms are inapplicable to many real-world problems. Data mining algorithms become ineffective when the problem size becomes very large. In many cases, the demands of the algorithm in terms of the running time are very large, and mining methods cannot be applied when the problem grows. This aspect is closely related to the time complexity of the method. A second problem is linked with performance; although the method might be applicable, the size of the search space prevents an efficient execution, and the resulting solutions are unsatisfactory. Two approaches have been used to deal with this problem: scaling up data mining algorithms and data reduction. However, because data reduction is a data mining task itself, this technique also suffers from scalability problems. Thus, for many problems, especially when dealing with very large datasets, the only way to deal with the aforementioned problems is to scale up the data mining algorithm. Many efforts have been made to obtain methods that can be used to scale up existing data mining algorithms. In this paper, we review the methods that have been used to address the problem of scalability. We focus on general ideas, rather than specific implementations, that can be used to provide a general view of the current approaches for scaling up data mining methods. A taxonomy of the algorithms is proposed, and many examples of different tasks are presented. Among the different techniques used for data mining, we will pay special attention to evolutionary methods, because these methods have been used very successfully in many data mining tasks.

show abstract

Parallel K-Means Clustering Algorithm on DNA Dataset

Cited by 10 publications

References 3 publications

Efficient algorithm for big data clustering on single machine

Efficient algorithm for big data clustering on single machine

Enhancement of Parallel K-Means algorithm

Scaling up data mining algorithms: review and taxonomy

Contact Info

Product

Resources

About