2020
DOI: 10.1007/s10618-020-00678-9
|View full text |Cite
|
Sign up to set email alerts
|

An efficient K-means clustering algorithm for tall data

Abstract: The analysis of continously larger datasets is a task of major importance in a wide variety of scientific fields. Therefore, the development of efficient and parallel algorithms to perform such an analysis is a a crucial topic in unsupervised learning. Cluster analysis algorithms are a key element of exploratory data analysis and, among them, the K-means algorithm stands out as the most popular approach due to its easiness in the implementation, straightforward parallelizability and relatively low computationa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
25
0
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 55 publications
(26 citation statements)
references
References 31 publications
0
25
0
1
Order By: Relevance
“…e input dataset is normalized first and then within the normalized range (0, 1), initial centroids are selected on a random basis. Using min-max similarity measure, the distance is calculated (0, 1) [39]. Ran Vijay and Bhatia [40] have introduced a K-Mean algorithm, items having least frequency.…”
Section: Literature Reviewmentioning
confidence: 99%
See 1 more Smart Citation
“…e input dataset is normalized first and then within the normalized range (0, 1), initial centroids are selected on a random basis. Using min-max similarity measure, the distance is calculated (0, 1) [39]. Ran Vijay and Bhatia [40] have introduced a K-Mean algorithm, items having least frequency.…”
Section: Literature Reviewmentioning
confidence: 99%
“…ey implement the parallelized Simple K-Mean clustering algorithm for the use of general laboratory. Para Means provide an easy and manageable client server application, written in C# [39].…”
Section: Literature Reviewmentioning
confidence: 99%
“…Among the different advantages, such as the easiness of its implementation, it must be remarked that both phases of Lloyd's algorithm (assignment and update steps) can be easily parallelized [12], which is a major key to meet the scalability of the algorithm [5]. Additionally, there exists a wide variety of speed-ups/approximations to the K-means algorithm, such as different distance prunning approaches [1], [13]- [15], the Minibatch K-means [16], the Boundary Weighted K-means [17] and several coreset techniques [18]- [23].…”
Section: B Lloyd's Algorithmmentioning
confidence: 99%
“…One of the main motivations of this work is to improve the scalability of Lloyd's algorithm with respect to the dimensionality of the K-means problem. Moreover, in the literature there are different competitive approximations to the Kmeans algorithm, such as [17], [20], [22], [48], that do not scale well with respect to this factor. The proposed feature selection algorithm will allow the use of these techniques to approximate the solution of the clustering problem on a tractable number of dimensions for them.…”
Section: Distributed Feature Selection Algorithm For K-means: Kmrmentioning
confidence: 99%
“…Nevertheless, K-means is completely dependent on the centroid of the initial clustering, whose selection causes widely differences in the execution time of clustering–repetition and the clustering results. In the K-means investigated in existing studies, the user arbitrarily determines the number of clusters to categorize initial data, and it leads to classification costs [ 26 , 27 , 28 , 29 ]. As data become more and more diverse in form and bigger and bigger in size, the fast execution that is the advantage of K-means cannot be maximized in the era of big-data [ 30 , 31 , 32 , 33 , 34 ].…”
Section: Introductionmentioning
confidence: 99%