Parallel Implementation of Decision Tree Learning Algorithms

Amado, Nuno; Gama, João; Silva, Fernando M. A.

doi:10.1007/3-540-45329-6_4

Cited by 18 publications

(9 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, Ben-Haim et al proposed a new algorithm called streaming parallel decision tree (SPDT). It builds the decision tree using horizontal parallelism for large datasets [24,25].…”

Section: Related Workmentioning

confidence: 99%

A novel parallel learning algorithm for pattern classification

Wang

Wei

2019

SN Appl. Sci.

View full text Add to dashboard Cite

In today's data-intensive applications, machine learning constructs algorithms that are capable of learning and making predications on the data. Margin setting algorithm (MSA) is a novel machine learning algorithm for pattern classification. It employs an artificial immune system approach to generates prototype regions as the classification boundaries. However, its computation time limited its applications in real-world application. When the datasets grow in size and algorithm complexity increases, it is necessary to spread the work among multiple cores and processors. To reduce the execution time during classification, a parallel implementation of MSA, called PMSA is proposed for multicore and multiprocessor system. It is the first work to scale up the classification time of MSA using parallel implementation. To evaluate the proposed PMSA algorithm, we used standard image datasets of 512 × 512 pixels and 321 × 481 pixels. Besides, benchmark datasets from University of California, Irvine Machine Learning Repository are also used. They are 768 data samples from dataset Pima Indian Diabetes, 683 data samples from dataset Wisconsin Breast Cancer, 690 data samples from dataset Australian Credit Approval, 178 data samples from dataset Wine and 391 data samples from dataset Svmguide2. The classification performance is compared with another two state-of-the-art classification algorithms: the artificial neural network and the support vector machine. The results show the proposed PMSA gains significant improvements in terms of execution time, with a promising speedup compared to the single-threaded CPU counterpart.

show abstract

“…For example, Ben-Haim et al proposed a new algorithm called streaming parallel decision tree (SPDT). It builds the decision tree using horizontal parallelism for large datasets [24,25].…”

Section: Related Workmentioning

confidence: 99%

A novel parallel learning algorithm for pattern classification

Wang

Wei

2019

SN Appl. Sci.

View full text Add to dashboard Cite

show abstract

“…We can categorize big data approaches to decision tree induction as follows: building one big tree (Andrzejak et al, 2013;Panda et al, 2009;Ntoutsi et al, 2008;Zhang and Jiang, 2012;Pawlik and Augsten, 2011;Narlikar, 1998;Sreenivas et al, 2000;Goil and Choudhary, 2001;Amado et al, 2001;Domingos and Hulten, 2000;Dai and Ji, 2014), transferring all decision trees into one rule base and back into a decision tree, ensemble approaches (Louppe and Geurts, 2012;Hansen and Salamon, 1990;Sollich and Krogh, 1996;Breiman, 1999), and others (e.g., Kargupta and Park, 2004) that do not build a new tree and use a combination of tree results. According to Ben-Haim and Tom-Tov (2010), another way to categorize the different types of algorithms for handling large datasets is to divide them into the following two groups: pre-sorting of data and using approximate representations of data.…”

Section: Background and Related Workmentioning

confidence: 99%

Interpretable decision-tree induction in a big data parallel framework

Weinberg

2017

International Journal of Applied Mathematics and Computer Science

View full text Add to dashboard Cite

When running data-mining algorithms on big data platforms, a parallel, distributed framework, such as MAPREDUCE, may be used. However, in a parallel framework, each individual model fits the data allocated to its own computing node without necessarily fitting the entire dataset. In order to induce a single consistent model, ensemble algorithms such as majority voting, aggregate the local models, rather than analyzing the entire dataset directly. Our goal is to develop an efficient algorithm for choosing one representative model from multiple, locally induced decision-tree models. The proposed SySM (syntactic similarity method) algorithm computes the similarity between the models produced by parallel nodes and chooses the model which is most similar to others as the best representative of the entire dataset. In 18.75% of 48 experiments on four big datasets, SySM accuracy is significantly higher than that of the ensemble; in about 43.75% of the experiments, SySM accuracy is significantly lower; in one case, the results are identical; and in the remaining 35.41% of cases the difference is not statistically significant. Compared with ensemble methods, the representative tree models selected by the proposed methodology are more compact and interpretable, their induction consumes less memory, and, as confirmed by the empirical results, they allow faster classification of new records.

show abstract

“…In this setting, the increasingly available parallel computing resources can be leveraged in two ways. First, proposals have been made to use these resources to improve the runtime performance of greedy search algorithms, i.e., to find the same results faster [2,6]. A second use of parallel computing resources that has been studied is to improve the task-based performance of greedy algorithms by addressing their shortcomings, namely their tendency to get caught in local extrema.…”

Section: Introductionmentioning

confidence: 99%

Bucket Selection: A Model-Independent Diverse Selection Strategy for Widening

Fillbrunn

Wörteler

Grossniklaus

et al. 2017

Advances in Intelligent Data Analysis XVI

View full text Add to dashboard Cite

When using a greedy algorithm for finding a model, as is the case in many data mining algorithms, there is a risk of getting caught in local extrema, i.e., suboptimal solutions. Widening is a technique for enhancing greedy algorithms by using parallel resources to broaden the search in the model space. The most important component of widening is the selector, a function that chooses the next models to refine. This selector ideally enforces diversity within the selected set of models in order to ensure that parallel workers explore sufficiently different parts of the model space and do not end up mimicking a simple beam search. Previous publications have shown that this works well for problems with a suitable distance measure for the models, but if no such measure is available, applying widening is challenging. In addition these approaches require extensive, sequential computations for diverse subset selection, making the entire process much slower than the original greedy algorithm. In this paper we propose the bucket selector, a model-independent randomized selection strategy. We find that (a) the bucket selector is a lot faster and not significantly worse when a diversity measure exists and (b) it performs better than existing selection strategies in cases without a diversity measure.

show abstract

Parallel Implementation of Decision Tree Learning Algorithms

Cited by 18 publications

References 1 publication

A novel parallel learning algorithm for pattern classification

A novel parallel learning algorithm for pattern classification

Interpretable decision-tree induction in a big data parallel framework

Bucket Selection: A Model-Independent Diverse Selection Strategy for Widening

Contact Info

Product

Resources

About