Ensembles on Random Patches

Louppe, Gilles; Geurts, Pierre

doi:10.1007/978-3-642-33460-3_28

Cited by 111 publications

(79 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…To create the different random subsamples we used three different methods: bagging, 33 pasting 34 and random patches. 8 The bagging 33 method consists in randomly drawn bootstrap subsets of the original data. Pasting 34 is a similar method in which the random samples are extracted without replacement.…”

Section: Ensembles Of Cost-sensitive Decision Treesmentioning

confidence: 99%

See 1 more Smart Citation

Fraud Detection by Stacking Cost-Sensitive Decision Trees

Bahnsen¹,

Villegas²,

Aouada

et al. 2018

Security Science and Technology

View full text Add to dashboard Cite

Section: Ensembles Of Cost-sensitive Decision Treesmentioning

confidence: 99%

“…The CSDT algorithm only creates one tree in order to make a classification, however, individual decision trees typically suffer from high variance. 8 A very efficient and simple way to address this flaw is to use them in the context of ensemble methods.…”

Section: Introductionmentioning

confidence: 99%

Fraud Detection by Stacking Cost-Sensitive Decision Trees

Bahnsen¹,

Villegas²,

Aouada

et al. 2018

Security Science and Technology

View full text Add to dashboard Cite

“…Another line of research considers variants that are tailored towards special cases. For instance, Louppe and Geurts [18] consider small subsets of the data, called patches. Each patch is based on a different subset of features and the overall ensemble consists of trees built independently on the patches.…”

Section: Large-scale Constructionmentioning

confidence: 99%

Training Big Random Forests with Little Resources

Gieseke

Igel

2018

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining

View full text Add to dashboard Cite

Without access to large compute clusters, building random forests on large datasets is still a challenging problem. This is, in particular, the case if fully-grown trees are desired. We propose a simple yet effective framework that allows to efficiently construct ensembles of huge trees for hundreds of millions or even billions of training instances using a cheap desktop computer with commodity hardware. The basic idea is to consider a multi-level construction scheme, which builds top trees for small random subsets of the available data and which subsequently distributes all training instances to the top trees' leaves for further processing. While being conceptually simple, the overall efficiency crucially depends on the particular implementation of the different phases. The practical merits of our approach are demonstrated using dense datasets with hundreds of millions of training instances.

show abstract

“…We can categorize big data approaches to decision tree induction as follows: building one big tree (Andrzejak et al, 2013;Panda et al, 2009;Ntoutsi et al, 2008;Zhang and Jiang, 2012;Pawlik and Augsten, 2011;Narlikar, 1998;Sreenivas et al, 2000;Goil and Choudhary, 2001;Amado et al, 2001;Domingos and Hulten, 2000;Dai and Ji, 2014), transferring all decision trees into one rule base and back into a decision tree, ensemble approaches (Louppe and Geurts, 2012;Hansen and Salamon, 1990;Sollich and Krogh, 1996;Breiman, 1999), and others (e.g., Kargupta and Park, 2004) that do not build a new tree and use a combination of tree results. According to Ben-Haim and Tom-Tov (2010), another way to categorize the different types of algorithms for handling large datasets is to divide them into the following two groups: pre-sorting of data and using approximate representations of data.…”

Section: Background and Related Workmentioning

confidence: 99%

Interpretable decision-tree induction in a big data parallel framework

Weinberg

2017

International Journal of Applied Mathematics and Computer Science

View full text Add to dashboard Cite

When running data-mining algorithms on big data platforms, a parallel, distributed framework, such as MAPREDUCE, may be used. However, in a parallel framework, each individual model fits the data allocated to its own computing node without necessarily fitting the entire dataset. In order to induce a single consistent model, ensemble algorithms such as majority voting, aggregate the local models, rather than analyzing the entire dataset directly. Our goal is to develop an efficient algorithm for choosing one representative model from multiple, locally induced decision-tree models. The proposed SySM (syntactic similarity method) algorithm computes the similarity between the models produced by parallel nodes and chooses the model which is most similar to others as the best representative of the entire dataset. In 18.75% of 48 experiments on four big datasets, SySM accuracy is significantly higher than that of the ensemble; in about 43.75% of the experiments, SySM accuracy is significantly lower; in one case, the results are identical; and in the remaining 35.41% of cases the difference is not statistically significant. Compared with ensemble methods, the representative tree models selected by the proposed methodology are more compact and interpretable, their induction consumes less memory, and, as confirmed by the empirical results, they allow faster classification of new records.

show abstract

Ensembles on Random Patches

Cited by 111 publications

References 7 publications

Fraud Detection by Stacking Cost-Sensitive Decision Trees

Fraud Detection by Stacking Cost-Sensitive Decision Trees

Training Big Random Forests with Little Resources

Interpretable decision-tree induction in a big data parallel framework

Contact Info

Product

Resources

About