A collaborative divide-and-conquer K-means clustering algorithm for processing large data

Cui, Huimin; Ruan, Gong; Xue, Jingling; Xie, Rui; Wang, Lei; Feng, Xiaobing

doi:10.1145/2597917.2597918

Cited by 13 publications

(10 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…More recently, [63] proposed a new efficient way to deal with large distributed datasets. The method is based on a collaborative divide-and-conquer algorithm using k-means as base clustering algorithm.…”

Section: Distributed Datamentioning

confidence: 99%

Collaborative clustering: Why, when, what and how

et al. 2018

View full text Add to dashboard Cite

Highlights• A didactic presentation of issues raised by clustering is given.• An up-to-date review and classi cation of collaborative clustering methods is presented.• The questions about why, when and how collaborative clustering can help are addressed. AbstractClustering is one type of unsupervised learning where the goal is to partition the set of objects into groups called clusters. Faced to the difficulty to design a general purpose clustering algorithm and to choose a good, let alone perfect, set of criteria for clustering a data set, one solution is to resort to a variety of clustering procedures based on different techniques, parameters and/or initializations, in order to construct one (or several) final clustering(s). The hope is that by combining several clustering solutions, each one with its own bias and imperfections, one will get a better overall solution.In the cooperative clustering model, as Ensemble Clustering, a set of clustering algorithms are used in parallel on a given data set: the local results are combined to get an hopefully better overall clustering. In the collaborative framework, the goal is that each local computation, quite possibly applied to distinct data sets, benefit from the work done by the other collaborators. This paper is dedicated to collaborative clustering. In particular, after a brief overview of clustering and the major issues linked to, it presents main challenges related to organize and control the collaborative process.

show abstract

Section: Distributed Datamentioning

confidence: 99%

Collaborative clustering: Why, when, what and how

et al. 2018

View full text Add to dashboard Cite

show abstract

“…Kmeans partitions a number of objects into k clusters such that similar objects belong to the same cluster [12].…”

Section: Platform and Benchmarkmentioning

confidence: 99%

Hadoop+

Cui

Lü

et al. 2015

Proceedings of the 29th ACM on International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

Despite the widespread adoption of heterogeneous clusters in modern data centers, modeling heterogeneity is still a big challenge, especially for large-scale MapReduce applications. In a CPU/GPU hybrid heterogeneous cluster, allocating more computing resources to a MapReduce application does not always mean better performance, since simultaneously running CPU and GPU tasks will contend for shared resources.This paper proposes a heterogeneity model to predict the shared resource contention between the simultaneously running tasks of a MapReduce application when heterogeneous computing resources (e.g. CPUs and GPUs) are allocated. To support the approach, we present a heterogeneous MapReduce framework, Hadoop+, which enables CPUs and GPUs to process big data coordinately, and leverages the heterogeneity model to assist users in selecting the computing resources for different purposes.Our experimental results show three benefits. First, Hadoop+ exploits GPU capability, and achieves 1.4x to 16.1x speedups over Hadoop for 5 real applications when running individually. Second, the heterogeneity model can be used to allocate GPUs among multiple simultaneously running MapReduce applications, bringing up to 36.9% (17.6% in average) speedup when multiple applications are running simultaneously. Third, the model is verified to be able to select the optimal or most cost-effective resource consumption.

show abstract

“…The scores in the rating matrix represent the significant features for the users and items, but the rating matrix commonly consists of unknown rating scores (data sparsity) which lower the quality of the predicted scores' accuracy. However, during the streaming of rating scores into the rating matrix, some rating scores deviate from its accurate places (Cui et al, 2014). Usually, the deviation is caused by the streaming of the huge amount of rating scores in the rating matrix without care for sorting and managing these scores to extract the accurate latent feedback.…”

Section: Introductionmentioning

confidence: 99%

“…The DFC algorithm randomly divides the large-scale matrix factorization task into smaller sub-problems and solve those subproblems in parallel and then combine them using ensemble methods based on low-rank approximations (Mackey et al, 2011). Cui et al (2014) have proposed the state-of-the-art divide and conquer k-means clustering algorithm to reduce the imprecision in rearranging the streaming data. Mackey et al (2011) have rearranged the matrix factorization based on the ensemble method and Cui et al (2014) have identified the data places based on the clustering method and its relations.…”

Section: Introductionmentioning

confidence: 99%

“…Cui et al (2014) have proposed the state-of-the-art divide and conquer k-means clustering algorithm to reduce the imprecision in rearranging the streaming data. Mackey et al (2011) have rearranged the matrix factorization based on the ensemble method and Cui et al (2014) have identified the data places based on the clustering method and its relations. However, none of these methods have focused on the similarity of users (sim u ) and the similarity of items (sim i ).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation