2020
DOI: 10.1155/2020/8884926
|View full text |Cite
|
Sign up to set email alerts
|

A Survey of Parallel Clustering Algorithms Based on Spark

Abstract: Clustering is one of the most important unsupervised machine learning tasks, which is widely used in information retrieval, social network analysis, image processing, and other fields. With the explosive growth of data, the classical clustering algorithms cannot meet the requirements of clustering for big data. Spark is one of the most popular parallel processing platforms for big data, and many researchers have proposed many parallel clustering algorithms based on Spark. In this paper, the existing parallel c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2021
2021
2025
2025

Publication Types

Select...
7

Relationship

0
7

Authors

Journals

citations
Cited by 11 publications
(2 citation statements)
references
References 69 publications
0
2
0
Order By: Relevance
“…Clustering is divided into a plethora of types, each of which necessitates an iterative procedure, making it unsuitable for largescale data processing. As a result, the single-trafficscale evolving clustering method (ECM) had to be transformed into a parallel clustering methodology (PECM) capable of handling large amounts of data [17]. PECM (parallel evolving clustering method) is a statistics evaluation technique that runs in the Apache spark framework and leverages HDFS (Hadoop distributed file system) for statistics storage [18].…”
Section: Related Workmentioning
confidence: 99%
“…Clustering is divided into a plethora of types, each of which necessitates an iterative procedure, making it unsuitable for largescale data processing. As a result, the single-trafficscale evolving clustering method (ECM) had to be transformed into a parallel clustering methodology (PECM) capable of handling large amounts of data [17]. PECM (parallel evolving clustering method) is a statistics evaluation technique that runs in the Apache spark framework and leverages HDFS (Hadoop distributed file system) for statistics storage [18].…”
Section: Related Workmentioning
confidence: 99%
“…Such a notably efficient KMeans-based is demonstrated in [21], whereas in [22] a highly efficient parallelization of the hierarchical agglomerative clustering method in Spark is also presented. A more detailed review on efficient parallel clustering algorithms for big data in Spark framework can be found in [29].…”
Section: Introductionmentioning
confidence: 99%