Parallel K-Medoids clustering algorithm based on Hadoop

Jiang, Yaobin; Zhang, Jiongmin

doi:10.1109/icsess.2014.6933652

Cited by 16 publications

(4 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We are, to the best of our knowledge, the first team to scale a semi-metric K-medoids algorithm to 1 billion points (with 6 attributes) (see BillionOne dataset appendix F.2.1). PAMAE [37] does run on a dataset of around 4 billion, but the data is restricted to Euclidean space, other distributed semi-metric K-medoids algorithms [19,22,28,40,41] are not run at this scale. To compare to other non-Euclidean algorithms, HPDBSCAN [16] demonstrates runs upto 82 million points with 4 attributes.…”

Section: Results and Comparisionsmentioning

confidence: 99%

Scalable K-Medoids via True Error Bound and Familywise Bandits

Babu¹,

Agarwal²,

Babu³

et al. 2019

Preprint

View full text Add to dashboard Cite

K-Medoids(KM) is a standard clustering method, used extensively on semi-metric data. Error analyses of KM have traditionally used an in-sample notion of error, which can be far from the true error and suffer from generalization error. We formalize the true K-Medoid error based on the underlying data distribution, by decomposing it into fundamental statistical problems of: minimum estimation (ME) and minimum mean estimation (MME). We provide a convergence result for MME and bound the true KM error for iid data. Inspired by this bound, we propose a computationally efficient, distributed KM algorithm namely MCPAM. MCPAM has expected runtime O(km) and provides massive computational savings for a small tradeoff in accuracy. We verify the quality and scaling properties of MCPAM on various datasets. And achieve the hitherto unachieved feat of calculating the KM of 1 billion points on semi-metric spaces.

show abstract

Section: Results and Comparisionsmentioning

confidence: 99%

Scalable K-Medoids via True Error Bound and Familywise Bandits

Babu¹,

Agarwal²,

Babu³

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…In [14], the optimal search of medoids is performed based on the basic properties of triangular geometry. The speed of k-medoids clustering is improved when the validity of the clustering result is maintained [18]. Parallel k-medoids clustering can also be implemented on Graphics Processing Unit (GPU).…”

Section: Article Infomentioning

confidence: 99%

“…[14][15][16][17], especially k-medoids clustering, that will be the focus of this research. One of the technologies that used to develop parallel k-medoids clustering is Hadoop-MapReduce[18][19][20][21][22]…”

mentioning

confidence: 99%

Parallelization of Partitioning Around Medoids (PAM) in K-Medoids Clustering on GPU

Prahara¹,

Ismi²,

Azhari³

2020

KEDS

View full text Add to dashboard Cite

K-medoids clustering is categorized as partitional clustering. K-medoids offers better result when dealing with outliers and arbitrary distance metric also in the situation when the mean or median does not exist within data. However, k-medoids suffers a high computational complexity. Partitioning Around Medoids (PAM) has been developed to improve k-medoids clustering, consists of build and swap steps and uses the entire dataset to find the best potential medoids. Thus, PAM produces better medoids than other algorithms. This research proposes the parallelization of PAM in k-medoids clustering on GPU to reduce computational time at the swap step of PAM. The parallelization scheme utilizes shared memory, reduction algorithm, and optimization of the thread block configuration to maximize the occupancy. Based on the experiment result, the proposed parallelized PAM k-medoids is faster than CPU and Matlab implementation and efficient for large dataset.

show abstract

“…Based on those improvement and according to the image set size grows constantly in the assistant platform, the dynamically built tree of BIRCH is suitable to the constantly grow image size without other clustering progress, then Jiang [7] introduced BIRCH to ancient character with improved k-medoids algorithm, but the threshold is constant. The BIRCH clustering is based on the subcluster, a lot of isolated data points are eliminated, then the clustering speed is improved.…”

Section: Fig1 the Classical Cf Treementioning

confidence: 99%

Real Time Dynamic Threshold Clustering Method for Ancient Chinese Character Identification

Yang

Jiang

Jian

et al. 2012

AMM

View full text Add to dashboard Cite

Character image clustering can find similar characters which have much reference value to ancient character identification. But the classical clustering method depends much on the threshold. To alter the threshold dynamically and get result in real time, a improved clustering method is proposed. The classical BIRCH CF tree was amended to chain hierarchical structure, and the new tree was built from bottom to top with KNN clustering method. Based on this structure, relative similarity degree was transformed to character number in clustering result. By converting the dynamic threshold clustering problem to finding the different cluster range, this method could get the clustering result in real time.

show abstract

Parallel K-Medoids clustering algorithm based on Hadoop

Cited by 16 publications

References 4 publications

Scalable K-Medoids via True Error Bound and Familywise Bandits

Scalable K-Medoids via True Error Bound and Familywise Bandits

Parallelization of Partitioning Around Medoids (PAM) in K-Medoids Clustering on GPU

Real Time Dynamic Threshold Clustering Method for Ancient Chinese Character Identification

Contact Info

Product

Resources

About