2011 IEEE 11th International Conference on Data Mining 2011
DOI: 10.1109/icdm.2011.39
|View full text |Cite
|
Sign up to set email alerts
|

COMET: A Recipe for Learning and Using Large Ensembles on Massive Data

Abstract: COMET is a single-pass MapReduce algorithm for learning on large-scale data. It builds multiple random forest ensembles on distributed blocks of data and merges them into a mega-ensemble. This approach is appropriate when learning from massive-scale data that is too large to fit on a single machine. To get the best accuracy, IVoting should be used instead of bagging to generate the training subset for each decision tree in the random forest. Experiments with two large datasets (5GB and 50GB compressed) show th… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
18
0

Year Published

2012
2012
2018
2018

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 31 publications
(19 citation statements)
references
References 24 publications
1
18
0
Order By: Relevance
“…On all datasets, this setting yields very poor performance when M max is low. Building base estimators on re-sampled random patches thus brings a clear advantage to RP, RS and P and hence confirms the conclusions of Basilico et al who showed in [8] that using more data indeed produces more accurate models than learning from a single subsample. This latter experiment furthermore shows that the good performances of RP cannot be trivially attributed to the fact that our datasets contain so many instances that only processing a subsample of them would be enough.…”
Section: Memory Reduction With Losssupporting
confidence: 70%
See 1 more Smart Citation
“…On all datasets, this setting yields very poor performance when M max is low. Building base estimators on re-sampled random patches thus brings a clear advantage to RP, RS and P and hence confirms the conclusions of Basilico et al who showed in [8] that using more data indeed produces more accurate models than learning from a single subsample. This latter experiment furthermore shows that the good performances of RP cannot be trivially attributed to the fact that our datasets contain so many instances that only processing a subsample of them would be enough.…”
Section: Memory Reduction With Losssupporting
confidence: 70%
“…The first benefit of RP is that it generalizes both the Pasting Rvotes (P) method [1] (and its extensions [7,8]) and the Random Subspace (RS) algorithm [2]. Both are indeed merely particular cases of RP: setting p s = 1.0 yields RS while setting p f = 1.0 yields P. As such, it is expected that when both hyper-parameters p s and p f are tuned, RP should be at least as good as the best of the two methods, provided there is no overfitting associated with this tuning.…”
Section: Related Workmentioning
confidence: 99%
“…[2,26] communicate the histograms re-built for each layer of tree nodes to a master worker for tree induction. [1] is a MapReduce algorithm which builds multiple random forest ensembles on distributed blocks of data and merges them into a mega-ensemble. In [11] ScalParC employs a distributed hash table to implement the splitting phase for classification problems.…”
Section: Related Workmentioning
confidence: 99%
“…Their popularity stems from the ability to (a) select, from the set of all attributes, a subset that is most relevant for the regression and classification problem at hand; (b) identify complex, non-linear correlations between attributes; and to (c) provide highly interpretable and human-readable models [7,17,19,25]. Recently due to the increasing amount of available data and the ubiquity of distributed computation platforms and clouds, there is a rapidly growing interest in designing distributed versions of regression and classification trees [1,2,17,21,26,28], for instance, the decision/regression tree in Apache Spark MLlib machine learning package 3 . Meanwhile, since many of the large datasets are from observations and measurements of physical entities and events, such data is inevitably noisy and skewed in part due to equipment malfunctions or abnormal events [10,12,27].…”
Section: Introductionmentioning
confidence: 99%
“…[59]. Recent work in the literature has shown that MapReduce can be utilized to scale tasks in semantic classification [58] [60,61]. In this work, MapReduce is employed to speed up the IF-MCA algorithm.…”
Section: Feature Selectionmentioning
confidence: 99%