Distributed Decision‐Tree Induction in Peer‐to‐Peer Systems

Bhaduri, Kanishka; Wolff, Ran; Giannella, Chris; Kargupta, Hillol

doi:10.1002/sam.10006

Cited by 64 publications

(57 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Parallel programming is incomplete without discussing the most recent approach called MAP Reduce. It can process large sized data in highly parallel manner [8]. Map Reduce was introduced by Google in 2004.…”

Section: Parallel Approches For Data Miningmentioning

confidence: 99%

Data Mining Techniques in Parallel Environment- A Comprehensive Survey

Shah¹,

Chauhan²,

Potdar³

et al. 2014

IJCA

View full text Add to dashboard Cite

Data mining is the process of discovering interesting and useful patterns and relationships in large volumes of data. The valuable knowledge can be discovered through the process of data mining for the further use and prediction. We have different data mining techniques like clustering classification and association. Classification is one of the major techniques to discover the patterns in huge amount of data. This technique is widely used in many fields. We have a large volume of data and if we extract the data sequentially then it will take a lot of timing. So if we extract the data parallely, the amount of time taken can be reduced. We can use parallel techniques when there is a large volume of data and we want to extract the data in very few seconds. We can implement this techniques using different approaches like MPI, OPENMP, using CUDA or using Map Reduce approach. Here in this paper we will discuss data mining techniques classification by decision tree induction and k-nearest neighbors using both sequential approach as well as parallel approach.

show abstract

Section: Parallel Approches For Data Miningmentioning

confidence: 99%

Data Mining Techniques in Parallel Environment- A Comprehensive Survey

Shah¹,

Chauhan²,

Potdar³

et al. 2014

IJCA

View full text Add to dashboard Cite

show abstract

“…Very few works address issues related to concept drift in a P2P network. A fully distributed decision tree induction method was proposed by Bhaduri et al [10]. The proposal involves drift detection, that triggers a tree update.…”

Section: Handling Concept Drift In Fully Distributed Environmentsmentioning

confidence: 99%

Massively Distributed Concept Drift Handling in Large Networks

Hegedűs

Ormándi

Jelasity

2013

Advs. Complex Syst.

View full text Add to dashboard Cite

Massively distributed data mining in large networks such as smart device platforms and peer-to-peer systems is a rapidly developing research area. One important problem here is concept drift, where global data patterns (movement, preferences, activities, etc.) change according to the actual set of participating users, the weather, the time of day, or as a result of events such as accidents or even natural catastrophes. In an important case-when the network is very large but only a few training samples can be obtained at each node locally-no efficient distributed solution is known that could follow concept drift efficiently. This case is characteristic of smart device platforms where each device stores only one local observation or data record related to a learning problem. Here we present two algorithms to handle concept drift. None of the algorithms collects data to a central location, instead models of the data perform random walks in the network, while being improved using an online learning algorithm. The first algorithm achieves adaptivity by maintaining young as well as old models in the network according to a fixed age distribution. The second one measures the performance of models locally, and discards them if they are judged outdated. We demonstrate through a thorough experimental analysis that our algorithms outperform the known competing methods if the number of independent local samples is limited relative to the speed of drift: a typical scenario in our targeted application domains. The two algorithms have different strengths: while the age distribution approach is very simple and efficient, explicit drift detection can be useful in monitoring applications to trigger control action.

show abstract

“…Bawa et al [4] developed an approach based on probabilistic counting. In addition, techniques have been developed for addressing more complex data mining/data problems over large-scale dynamic networks: association rule mining [28], facility location [24], outlier detection [9], decision tree induction [7], ensemble classification [25], support vector machine-based classification [1], K-means clustering [11], top-K query processing [3]. A related line of research concerns the monitoring of various kinds of data models over large numbers of data streams.…”

Section: Data Analysis In Large Dynamic Networkmentioning

confidence: 99%

Scalable Distributed Change Detection from Astronomy Data Streams using Local, Asynchronous Eigen Monitoring Algorithms

Das¹,

Bhaduri²,

Arora³

et al. 2009

Proceedings of the 2009 SIAM International Conference on Data Mining

Self Cite

View full text Add to dashboard Cite

This paper considers the problem of change detection using local distributed eigen monitoring algorithms for next generation of astronomy petascale data pipelines such as the Large Synoptic Survey Telescopes (LSST). This telescope will take repeat images of the night sky every 20 seconds, thereby generating 30 terabytes of calibrated imagery every night that will need to be coanalyzed with other astronomical data stored at different locations around the world. Change point detection and event classification in such data sets may provide useful insights to unique astronomical phenomenon displaying astrophysically significant variations: quasars, supernovae, variable stars, and potentially hazardous asteroids. However, performing such data mining tasks is a challenging problem for such high-throughput distributed data streams. In this paper we propose a highly scalable and distributed asynchronous algorithm for monitoring the principal components (PC) of such dynamic data streams. We demonstrate the algorithm on a large set of distributed astronomical data to accomplish well-known astronomy tasks such as measuring variations in the fundamental plane of galaxy parameters. The proposed algorithm is provably correct (i.e. converges to the correct PCs without centralizing any data) and can seamlessly handle changes to the data or the network. Real experiments performed on Sloan Digital Sky Survey (SDSS) catalogue data show the effectiveness of the algorithm.

show abstract

Distributed Decision‐Tree Induction in Peer‐to‐Peer Systems

Cited by 64 publications

References 46 publications

Data Mining Techniques in Parallel Environment- A Comprehensive Survey

Data Mining Techniques in Parallel Environment- A Comprehensive Survey

Massively Distributed Concept Drift Handling in Large Networks

Scalable Distributed Change Detection from Astronomy Data Streams using Local, Asynchronous Eigen Monitoring Algorithms

Contact Info

Product

Resources

About