GDCluster: A General Decentralized Clustering Algorithm

Mashayekhi, Hoda; Habibi, Jafar; Khalafbeigi, Tania; Voulgaris, Spyros; Steen, Maarten van

doi:10.1109/tkde.2015.2391123

Cited by 24 publications

(7 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Decentralized clustering on distributed data using P2P has been studied recently. Some solutions introduce distributed k -means algorithms that construct a global set of artificial points to act as a proxy for the entire dataset [1,16]. There are some solutions that consider P2P random networks and work in static settings, however they are aimed at computing basic average of centroids.…”

Section: Related Workmentioning

confidence: 99%

“…The third method is agml, a dicentralized version of P2P k -means that allows nodes to exchage summaries representing their local data, then apply clustering using generated summaries [5]. Also, we implemented gdc as a P2P k -means algothim that that allows nodes to generate and exchange a global set of artificial points to act as a proxy for the entire dataset [16]. Finally, we use golf that is a P2P k -means implemented using gossip protocol [2].…”

Section: K -Means Clustering Algorithmsmentioning

confidence: 99%

See 1 more Smart Citation

Decentralized and Adaptive K-Means Clustering for Non-IID Data Using HyperLogLog Counters

Soliman

Girdzijauskas

Bouguelia

et al. 2020

Advances in Knowledge Discovery and Data Mining

View full text Add to dashboard Cite

The data shared over the Internet tends to originate from ubiquitous and autonomous sources such as mobile phones, fitness trackers, and IoT devices. Centralized and federated machine learning solutions represent the predominant way of providing smart services for users. However, moving data to central location for analysis causes not only many privacy concerns, but also communication overhead. Therefore, in certain situations machine learning models need to be trained in a collaborative and decentralized manner, similar to the way the data is originally generated without requiring any central authority for data or model aggregation. This paper presents a decentralized and adaptive k -means algorithm that clusters data from multiple sources organized in peer-topeer networks. Our algorithm allows peers to reach an approximation of the global model without sharing any raw data. Most importantly, we address the challenge of decentralized clustering with skewed non-IID data and asynchronous computations by integrating HyperLogLog counters with k -means algorithm. Furthermore, our clustering algorithm allows nodes to individually determine the number of clusters that fits their local data. Results using synthetic and real-world datasets show that our algorithm outperforms state-of-the-art decentralized k -means algorithms achieving accuracy gain that is up-to 36%.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: K -Means Clustering Algorithmsmentioning

confidence: 99%

Decentralized and Adaptive K-Means Clustering for Non-IID Data Using HyperLogLog Counters

Soliman

Girdzijauskas

Bouguelia

et al. 2020

Advances in Knowledge Discovery and Data Mining

View full text Add to dashboard Cite

show abstract

“…Recently, two K‐means‐based models, distributed PCA and K‐means and KPCA+ K‐means clustering, were developed based on the PCA concept and kernel PCA concept. Mashayekhi et al proposed GDCluster, a general fully decentralized clustering method, which is capable of clustering dynamic and distributed datasets . In GDCluster, nodes continuously cooperate through decentralized gossip‐based communication to maintain summarized views of the dataset.…”

Section: Data Mining Techniques In Distributed Environmentmentioning

confidence: 99%

“…Mashayekhi et al proposed GDCluster, a general fully decentralized clustering method, which is capable of clustering dynamic and distributed datasets. 126 In GDCluster, nodes continuously cooperate through decentralized gossip-based communication to maintain summarized views of the dataset. Other approaches for DC are still in progress.…”

Section: Distributed Clusteringmentioning

confidence: 99%

Data mining in distributed environment: a survey

Gan

Lin

Chao

et al. 2017

WIREs Data Min & Knowl

121

View full text Add to dashboard Cite

Due to the rapid growth of resource sharing, distributed systems are developed, which can be used to utilize the computations. Data mining (DM) provides powerful techniques for finding meaningful and useful information from a very large amount of data, and has a wide range of real‐world applications. However, traditional DM algorithms assume that the data is centrally collected, memory‐resident, and static. It is challenging to manage the large‐scale data and process them with very limited resources. For example, large amounts of data are quickly produced and stored at multiple locations. It becomes increasingly expensive to centralize them in a single place. Moreover, traditional DM algorithms generally have some problems and challenges, such as memory limits, low processing ability, and inadequate hard disk, and so on. To solve the above problems, DM on distributed computing environment [also called distributed data mining (DDM)] has been emerging as a valuable alternative in many applications. In this study, a survey of state‐of‐the‐art DDM techniques is provided, including distributed frequent itemset mining, distributed frequent sequence mining, distributed frequent graph mining, distributed clustering, and privacy preserving of distributed data mining. We finally summarize the opportunities of data mining tasks in distributed environment. WIREs Data Mining Knowl Discov 2017, 7:e1216. doi: 10.1002/widm.1216 This article is categorized under: Application Areas > Business and Industry Fundamental Concepts of Data and Knowledge > Motivation and Emergence of Data Mining Technologies > Computer Architectures for Data Mining

show abstract

“…A number of applications based on Epidemic protocols have been proposed to serve different purposes in different environments. For example, Epidemic protocols have been employed to implement applications for Peer-to-Peer (P2P) overlay networks [1,2,3], distributed computing [4], mobile ad hoc networks (MANET) [5], wireless sensor networks (WSN) [6,7,8,9,10], failure detection [11], distributed data mining [12,13,14] and exascale high performance computing [15,16,17].…”

Section: Introductionmentioning

confidence: 99%

Robust and efficient membership management in large-scale dynamic networks

Poonpakdee

Fatta

2017

Future Generation Computer Systems

View full text Add to dashboard Cite

Epidemic protocols are a bio-inspired communication and computation paradigm for large-scale network system based on randomised communication. These protocols rely on a membership service to build decentralised and random overlay topologies. In large-scale, dynamic network environments, node churn and failures may have a detrimental effect on the structure of the overlay topologies with negative impact on the efficiency and the accuracy of applications. Most importantly, there exists the risk of a permanent loss of global connectivity that would prevent the correct convergence of applications. This work investigates to what extent a dynamic network environment may negatively affect the performance of Epidemic membership protocols. A novel Enhanced Expander Membership Protocol (EMP+) based on the expansion properties of graphs is presented. The proposed protocol is evaluated against other membership protocols and the comparative analysis shows that EMP+ can support faster application convergence and is the first membership protocol to provide robustness against global network connectivity problems.

show abstract

GDCluster: A General Decentralized Clustering Algorithm

Cited by 24 publications

References 28 publications

Decentralized and Adaptive K-Means Clustering for Non-IID Data Using HyperLogLog Counters

Decentralized and Adaptive K-Means Clustering for Non-IID Data Using HyperLogLog Counters

Data mining in distributed environment: a survey

Robust and efficient membership management in large-scale dynamic networks

Contact Info

Product

Resources

About