Systematic Review of Clustering High-Dimensional and Large Datasets

Pandove, Divya; Goel, Shivan; Rani, Rinkl

doi:10.1145/3132088

Cited by 60 publications

(30 citation statements)

References 133 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…SubCLU was proposed by Kailing et al in 2004, and it is one of the best subspace clustering algorithms up to now [10]. Its subspace search direction is bottom-up, which starts from one-dimensional subspace and gradually extends to multidimensional subspace, and finds clusters based on density in all subspaces.…”

Section: B Subclumentioning

confidence: 99%

“…Researchers have proposed many clustering algorithms in recent decades [5][6][7][8][9], and these algorithms can be roughly divided into five categories [10]: (1) K-means [11], kmedoids [12] and other algorithms based on partition; (2) BIRCH [13], CURE [14], CHAMELEON [15] and other hierarchical-based algorithms; (3) DBSCAN [16], OPTICS [17], DENCLUE [18] and other density-based algorithms; (4) Grid-based algorithms such as STRING [19], OPTIGRID [20];(5) model-based algorithms such as EM [21], COBWEB [22]. These algorithms mentioned above can meet the needs of clustering small low dimensional datasets.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

PSubCLUS: A Parallel Subspace Clustering Algorithm Based On Spark

Xiao

2021

IEEE Access

View full text Add to dashboard Cite

Section: B Subclumentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

PSubCLUS: A Parallel Subspace Clustering Algorithm Based On Spark

Xiao

2021

IEEE Access

View full text Add to dashboard Cite

“…The main characteristics of a clustering algorithm include: (1) scalability, i.e., the ability to manage a growing number of individuals in a limited period of time, (2) adaptability to identify different clusters, (3) self-driven, i.e., it should require no knowledge of the problem domain, (4) stability which means that the algorithm is not influenced in the presence of noise or/and outliers, and (5) data-independency, i.e., the algorithm should not be affected by the organization of individuals in the dataset [43].…”

Section: Parallel Clustering Algorithmsmentioning

confidence: 99%

Evaluation of Clustering Algorithms on GPU-Based Edge Computing Platforms

Cecilia

Cano

Morales-García

et al. 2020

Sensors

View full text Add to dashboard Cite

Internet of Things (IoT) is becoming a new socioeconomic revolution in which data and immediacy are the main ingredients. IoT generates large datasets on a daily basis but it is currently considered as “dark data”, i.e., data generated but never analyzed. The efficient analysis of this data is mandatory to create intelligent applications for the next generation of IoT applications that benefits society. Artificial Intelligence (AI) techniques are very well suited to identifying hidden patterns and correlations in this data deluge. In particular, clustering algorithms are of the utmost importance for performing exploratory data analysis to identify a set (a.k.a., cluster) of similar objects. Clustering algorithms are computationally heavy workloads and require to be executed on high-performance computing clusters, especially to deal with large datasets. This execution on HPC infrastructures is an energy hungry procedure with additional issues, such as high-latency communications or privacy. Edge computing is a paradigm to enable light-weight computations at the edge of the network that has been proposed recently to solve these issues. In this paper, we provide an in-depth analysis of emergent edge computing architectures that include low-power Graphics Processing Units (GPUs) to speed-up these workloads. Our analysis includes performance and power consumption figures of the latest Nvidia’s AGX Xavier to compare the energy-performance ratio of these low-cost platforms with a high-performance cloud-based counterpart version. Three different clustering algorithms (i.e., k-means, Fuzzy Minimals (FM), and Fuzzy C-Means (FCM)) are designed to be optimally executed on edge and cloud platforms, showing a speed-up factor of up to 11× for the GPU code compared to sequential counterpart versions in the edge platforms and energy savings of up to 150% between the edge computing and HPC platforms.

show abstract

“…This paper selects the K-Means (KM) to study cluster quality, execution time, speed up, memory utilization, and scalability under big data mining setup considering the initial centroid initialization. The KM clustering is widely adopted for segmentation, text mining, bioinformatics, wireless sensor networks, financial discipline, data compression, texture segmentation, computer vision, vector quantization, etc (Pandove et al, 2018;Xie et al, 2019).…”

Section: Introductionmentioning

confidence: 99%

Maxmin Data Range Heuristic-Based Initial Centroid Method of Partitional Clustering for Big Data Mining

Pandey

Shukla

2021

International Journal of Information Retrieval Research

View full text Add to dashboard Cite

The centroid-based clustering algorithm depends on the number of clusters, initial centroid, distance measures, and statistical approach of central tendencies. The initial centroid initialization algorithm defines convergence speed, computing efficiency, execution time, scalability, memory utilization, and performance issues for big data clustering. Nowadays various researchers have proposed the cluster initialization techniques, where some initialization techniques reduce the number of iterations with the lowest cluster quality, and some initialization techniques increase the cluster quality with high iterations. For these reasons, this study proposed the initial centroid initialization based Maxmin Data Range Heuristic (MDRH) method for K-Means (KM) clustering that reduces the execution times, iterations, and improves quality for big data clustering. The proposed MDRH method has compared against the classical KM and KM++ algorithms with four real datasets. The MDRH method has achieved better effectiveness and efficiency over RS, DB, CH, SC, IS, and CT quantitative measurements.

show abstract

Systematic Review of Clustering High-Dimensional and Large Datasets

Cited by 60 publications

References 133 publications

PSubCLUS: A Parallel Subspace Clustering Algorithm Based On Spark

PSubCLUS: A Parallel Subspace Clustering Algorithm Based On Spark

Evaluation of Clustering Algorithms on GPU-Based Edge Computing Platforms

Maxmin Data Range Heuristic-Based Initial Centroid Method of Partitional Clustering for Big Data Mining

Contact Info

Product

Resources

About