A MapReduce-based improvement algorithm for DBSCAN

Hu, Xiaojuan; Liu, Lei; Qiu, Ningjia; Yang, Di; Li, Meng

doi:10.1177/1748301817735665

Cited by 22 publications

(8 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It also aids computational biologists who are testing and benchmarking new clustering algorithms, evaluation metrics and pre-or post-processing steps [10]. Future iterations of hypercluster could include further cutting-edge clustering techniques, including those designed for larger data sets [31,32] or account for multiple types of data [48]. Hypercluster streamlines comparative unsupervised clustering, allowing the prioritization of both convenience and rigor.…”

Section: Discussionmentioning

confidence: 99%

“…Typically, the effect of hyperparameter choice on the quality of clustering results cannot be described with a convex function, meaning that hyperparameters should be chosen through exhaustive grid search [ 29 ], a slow and cumbersome process. Software packages for automatic hyperparameter tuning and model selection for regression and classification exist, notably auto-sklearn from AutoML [ 30 ], and some groups have made excellent tools for distributing a single clustering calculation for huge datasets [ 31 , 32 ], but to the best of our knowledge, there is no package for comparing several clustering algorithms and hyperparameters.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Hypercluster: a flexible tool for parallelized unsupervised clustering optimization

Blumenberg

Ruggles

2020

BMC Bioinformatics

View full text Add to dashboard Cite

Background Unsupervised clustering is a common and exceptionally useful tool for large biological datasets. However, clustering requires upfront algorithm and hyperparameter selection, which can introduce bias into the final clustering labels. It is therefore advisable to obtain a range of clustering results from multiple models and hyperparameters, which can be cumbersome and slow. Results We present hypercluster, a python package and SnakeMake pipeline for flexible and parallelized clustering evaluation and selection. Users can efficiently evaluate a huge range of clustering results from multiple models and hyperparameters to identify an optimal model. Conclusions Hypercluster improves ease of use, robustness and reproducibility for unsupervised clustering application for high throughput biology. Hypercluster is available on pip and bioconda; installation, documentation and example workflows can be found at: https://github.com/ruggleslab/hypercluster.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Hypercluster: a flexible tool for parallelized unsupervised clustering optimization

Blumenberg

Ruggles

2020

BMC Bioinformatics

View full text Add to dashboard Cite

show abstract

“…Gotz et al [43] present HPDBSCAN, an algorithm for both shared-memory and distributed-memory based on partitioning the data among processors, running DBSCAN locally on each partition, and then merging the clusters together. Exact and approximate distributed DBSCAN algorithms have been designed using the MapReduce [7,34,39,51,53,63,90,92] and Spark [32,49,54,68,69,82] paradigms. RP-DBSCAN [82], which is an approximate DBSCAN algorithm, has been shown to be the state-of-the-art for MapReduce and Spark.…”

Section: Related Workmentioning

confidence: 99%

Theoretically-Efficient and Practical Parallel DBSCAN

Wang¹,

Gu²,

Shun³

2019

Preprint

View full text Add to dashboard Cite

The DBSCAN method for spatial clustering has received significant attention due to its applicability in a variety of data analysis tasks. There are fast sequential algorithms for DB-SCAN in Euclidean space that take O(n log n) work for two dimensions, sub-quadratic work for three or more dimensions, and can be computed approximately in linear work for any constant number of dimensions. However, existing parallel DBSCAN algorithms require quadratic work in the worst case, making them inefficient for large datasets. This paper bridges the gap between theory and practice of parallel DBSCAN by presenting new parallel algorithms for Euclidean exact DBSCAN and approximate DBSCAN that match the work bounds of their sequential counterparts, and are highly parallel (polylogarithmic depth). We present implementations of our algorithms along with optimizations that improve their practical performance. We perform a comprehensive experimental evaluation of our algorithms on a variety of datasets and parameter settings. Our experiments on a 36-core machine with hyper-threading show that we outperform existing parallel DBSCAN implementations by up to several orders of magnitude, and achieve speedups by up to 33x over the best sequential algorithms.

show abstract

“…The expression levels of a gene across multiple experimental settings are referred to as a gene expression profile, whereas the expression levels of all genes in a sample are referred to as a sample expression profile. Researchers can evaluate the expression levels of a large number of genes in a variety of samples and settings by using microarrays [3]. The information gathered from them is referred to as gene expression data.…”

Section: Introductionmentioning

confidence: 99%

Gene Expression Analysis via Spatial Clustering and Evaluation Indexing

2022

ijcsm

View full text Add to dashboard Cite

The density-based spatial clustering for applications with noise (DBSCAN) is one of the most popular applications of clustering in data mining, and it is used to identify useful patterns and interesting distributions in the underlying data. Aggregation methods for classifying nonlinear aggregated data. In particular, DNA methylations, gene expression. That show the differentially skewed by distance sites and grouped nonlinearly by cancer daisies and the change Situations for gene excretion on it. Under these conditions, DBSCAN is expected to have a desirable clustering feature i that can be used to show the results of the changes. This research reviews the DBSCAN and compares its performance with other algorithms, such as the traditional number of clustering, K-mean particle swarm optimization (PSO), and Grey–Wolf optimization (GWO). This method offers high performance for improvement. The DBSCAN algorithm also offers better results of clusters and gives better performance assessment according to the results shown in this study.

show abstract

A MapReduce-based improvement algorithm for DBSCAN

Cited by 22 publications

References 14 publications

Hypercluster: a flexible tool for parallelized unsupervised clustering optimization

Hypercluster: a flexible tool for parallelized unsupervised clustering optimization

Theoretically-Efficient and Practical Parallel DBSCAN

Gene Expression Analysis via Spatial Clustering and Evaluation Indexing

Contact Info

Product

Resources

About