An efficient K-means clustering algorithm for tall data

Capó, Marco; Pérez, Aritz; Lozano, José A.

doi:10.1007/s10618-020-00678-9

Cited by 55 publications

(26 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…e input dataset is normalized first and then within the normalized range (0, 1), initial centroids are selected on a random basis. Using min-max similarity measure, the distance is calculated (0, 1) [39]. Ran Vijay and Bhatia [40] have introduced a K-Mean algorithm, items having least frequency.…”

Section: Literature Reviewmentioning

confidence: 99%

See 1 more Smart Citation

Analysis of Simple K-Mean and Parallel K-Mean Clustering for Software Products and Organizational Performance Using Education Sector Dataset

Shang

Ara

Zada

et al. 2021

Scientific Programming

View full text Add to dashboard Cite

Context. Educational Data Mining (EDM) is a new and emerging research area. Data mining techniques are used in the educational field in order to extract useful information on employee or student progress behaviors. Recent increase in the availability of learning data has given importance and momentum to educational data mining to better understand and optimize the learning process and the environments in which it takes place. Objective. Data are the most valuable commodity for any organization. It is very difficult to extract useful information from such a large and massive collection of data. Data mining techniques are used to forecast and evaluate academic performance of students based on their academic record and participation in the forum. Although several studies have been carried out to evaluate the academic performance of students worldwide, there is a lack of appropriate studies to assess factors that can boost the academic performance of students. Methodology. The current study sought to weigh up factors that contribute to improving student academic performance in Pakistan. In this paper, both the simple and parallel clustering techniques are implemented and analyzed to point out their best features. The Parallel K-Mean algorithms overcome the problems of simple algorithm and the outcomes of the parallel algorithms are always the same, which improves the cluster quality, number of iterations, and elapsed time. Results. Both the algorithms are tested and compared with each other for a dataset of 10,000 and 5000 integer data items. The datasets are evaluated 10 times for minimum elapse time-varying K value from 1 to 10. The proposed study is more useful for scientific research data sorting. Scientific research data statistics are more accurate.

show abstract

Section: Literature Reviewmentioning

confidence: 99%

“…ey implement the parallelized Simple K-Mean clustering algorithm for the use of general laboratory. Para Means provide an easy and manageable client server application, written in C# [39].…”

Section: Literature Reviewmentioning

confidence: 99%

Analysis of Simple K-Mean and Parallel K-Mean Clustering for Software Products and Organizational Performance Using Education Sector Dataset

Shang

Ara

Zada

et al. 2021

Scientific Programming

View full text Add to dashboard Cite

show abstract

“…Among the different advantages, such as the easiness of its implementation, it must be remarked that both phases of Lloyd's algorithm (assignment and update steps) can be easily parallelized [12], which is a major key to meet the scalability of the algorithm [5]. Additionally, there exists a wide variety of speed-ups/approximations to the K-means algorithm, such as different distance prunning approaches [1], [13]- [15], the Minibatch K-means [16], the Boundary Weighted K-means [17] and several coreset techniques [18]- [23].…”

Section: B Lloyd's Algorithmmentioning

confidence: 99%

“…One of the main motivations of this work is to improve the scalability of Lloyd's algorithm with respect to the dimensionality of the K-means problem. Moreover, in the literature there are different competitive approximations to the Kmeans algorithm, such as [17], [20], [22], [48], that do not scale well with respect to this factor. The proposed feature selection algorithm will allow the use of these techniques to approximate the solution of the clustering problem on a tractable number of dimensions for them.…”

Section: Distributed Feature Selection Algorithm For K-means: Kmrmentioning

confidence: 99%

A Cheap Feature Selection Approach for the K-Means Algorithm

Capó

Pérez

Lozano

2021

IEEE Trans. Neural Netw. Learning Syst.

Self Cite

View full text Add to dashboard Cite

The increase in the number of features that need to be analyzed in a wide variety of areas, such as genome sequencing, computer vision or sensor networks, represents a challenge for the K-means algorithm. In this regard, different dimensionality reduction approaches for the K-means algorithm have been designed recently, leading to algorithms that have proved to generate competitive clusterings. Unfortunately, most of these techniques tend to have fairly high computational costs and/or might not be easy to parallelize. In this work, we propose a fully-parellelizable feature selection technique intended for the K-means algorithm. The proposal is based on a novel feature relevance measure that is closely related to the K-means error of a given clustering. Given a disjoint partition of the features, the technique consists of obtaining a clustering for each subset of features and selecting the m features with the highest relevance measure. The computational cost of this approach is just O(m • max{n • K, log m}) per subset of features. We additionally provide a theoretical analysis on the quality of the obtained solution via our proposal, and empirically analyze its performance with respect to well-known feature selection and feature extraction techniques. Such an analysis shows that our proposal consistently obtains results with lower K-means error than all the considered feature selection techniques: Laplacian scores, maximum variance, multi-cluster feature selection and random selection, while also requiring similar or lower computational times than these approaches. Moreover, when compared to feature extraction techniques, such as Random Projections, the proposed approach also shows a noticeable improvement in both error and computational time.

show abstract

“…Nevertheless, K-means is completely dependent on the centroid of the initial clustering, whose selection causes widely differences in the execution time of clustering–repetition and the clustering results. In the K-means investigated in existing studies, the user arbitrarily determines the number of clusters to categorize initial data, and it leads to classification costs [ 26 , 27 , 28 , 29 ]. As data become more and more diverse in form and bigger and bigger in size, the fast execution that is the advantage of K-means cannot be maximized in the era of big-data [ 30 , 31 , 32 , 33 , 34 ].…”

Section: Introductionmentioning

confidence: 99%

A Novel Model on Reinforce K-Means Using Location Division Model and Outlier of Initial Value for Lowering Data Cost

Jung

Lee

2020

Entropy

View full text Add to dashboard Cite

Today, semi-structured and unstructured data are mainly collected and analyzed for data analysis applicable to various systems. Such data have a dense distribution of space and usually contain outliers and noise data. There have been ongoing research studies on clustering algorithms to classify such data (outliers and noise data). The K-means algorithm is one of the most investigated clustering algorithms. Researchers have pointed out a couple of problems such as processing clustering for the number of clusters, K, by an analyst through his or her random choices, producing biased results in data classification through the connection of nodes in dense data, and higher implementation costs and lower accuracy according to the selection models of the initial centroids. Most K-means researchers have pointed out the disadvantage of outliers belonging to external or other clusters instead of the concerned ones when K is big or small. Thus, the present study analyzed problems with the selection of initial centroids in the existing K-means algorithm and investigated a new K-means algorithm of selecting initial centroids. The present study proposed a method of cutting down clustering calculation costs by applying an initial center point approach based on space division and outliers so that no objects would be subordinate to the initial cluster center for dependence lower from the initial cluster center. Since data containing outliers could lead to inappropriate results when they are reflected in the choice of a center point of a cluster, the study proposed an algorithm to minimize the error rates of outliers based on an improved algorithm for space division and distance measurement. The performance experiment results of the proposed algorithm show that it lowered the execution costs by about 13–14% compared with those of previous studies when there was an increase in the volume of clustering data or the number of clusters. It also recorded a lower frequency of outliers, a lower effectiveness index, which assesses performance deterioration with outliers, and a reduction of outliers by about 60%.

show abstract

An efficient K-means clustering algorithm for tall data

Cited by 55 publications

References 31 publications

Analysis of Simple K-Mean and Parallel K-Mean Clustering for Software Products and Organizational Performance Using Education Sector Dataset

Analysis of Simple K-Mean and Parallel K-Mean Clustering for Software Products and Organizational Performance Using Education Sector Dataset

A Cheap Feature Selection Approach for the K-Means Algorithm

A Novel Model on Reinforce K-Means Using Location Division Model and Outlier of Initial Value for Lowering Data Cost

Contact Info

Product

Resources

About