COMET: A Recipe for Learning and Using Large Ensembles on Massive Data

Basilico, Justin Derrick; Munson, M. Arthur; Kolda, Tamara G.; Dixon, Kevin R.; Kegelmeyer, W. Philip

doi:10.1109/icdm.2011.39

Cited by 31 publications

(19 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On all datasets, this setting yields very poor performance when M max is low. Building base estimators on re-sampled random patches thus brings a clear advantage to RP, RS and P and hence confirms the conclusions of Basilico et al who showed in [8] that using more data indeed produces more accurate models than learning from a single subsample. This latter experiment furthermore shows that the good performances of RP cannot be trivially attributed to the fact that our datasets contain so many instances that only processing a subsample of them would be enough.…”

Section: Memory Reduction With Losssupporting

confidence: 70%

“…The first benefit of RP is that it generalizes both the Pasting Rvotes (P) method [1] (and its extensions [7,8]) and the Random Subspace (RS) algorithm [2]. Both are indeed merely particular cases of RP: setting p s = 1.0 yields RS while setting p f = 1.0 yields P. As such, it is expected that when both hyper-parameters p s and p f are tuned, RP should be at least as good as the best of the two methods, provided there is no overfitting associated with this tuning.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Ensembles on Random Patches

Louppe

Geurts

2012

Lecture Notes in Computer Science

111

View full text Add to dashboard Cite

Abstract. In this paper, we consider supervised learning under the assumption that the available memory is small compared to the dataset size. This general framework is relevant in the context of big data, distributed databases and embedded systems. We investigate a very simple, yet effective, ensemble framework that builds each individual model of the ensemble from a random patch of data obtained by drawing random subsets of both instances and features from the whole dataset. We carry out an extensive and systematic evaluation of this method on 29 datasets, using decision tree-based estimators. With respect to popular ensemble methods, these experiments show that the proposed method provides on par performance in terms of accuracy while simultaneously lowering the memory needs, and attains significantly better performance when memory is severely constrained.

show abstract

Section: Memory Reduction With Losssupporting

confidence: 70%

Section: Related Workmentioning

confidence: 99%

Ensembles on Random Patches

Louppe

Geurts

2012

Lecture Notes in Computer Science

111

View full text Add to dashboard Cite

show abstract

“…[2,26] communicate the histograms re-built for each layer of tree nodes to a master worker for tree induction. [1] is a MapReduce algorithm which builds multiple random forest ensembles on distributed blocks of data and merges them into a mega-ensemble. In [11] ScalParC employs a distributed hash table to implement the splitting phase for classification problems.…”

Section: Related Workmentioning

confidence: 99%

“…Their popularity stems from the ability to (a) select, from the set of all attributes, a subset that is most relevant for the regression and classification problem at hand; (b) identify complex, non-linear correlations between attributes; and to (c) provide highly interpretable and human-readable models [7,17,19,25]. Recently due to the increasing amount of available data and the ubiquity of distributed computation platforms and clouds, there is a rapidly growing interest in designing distributed versions of regression and classification trees [1,2,17,21,26,28], for instance, the decision/regression tree in Apache Spark MLlib machine learning package 3 . Meanwhile, since many of the large datasets are from observations and measurements of physical entities and events, such data is inevitably noisy and skewed in part due to equipment malfunctions or abnormal events [10,12,27].…”

Section: Introductionmentioning

confidence: 99%

Efficient Distributed Decision Trees for Robust Regression

Guo

Kutzkov

Ahmed

et al. 2016

Machine Learning and Knowledge Discovery in Databases

View full text Add to dashboard Cite

Abstract. The availability of massive volumes of data and recent advances in data collection and processing platforms have motivated the development of distributed machine learning algorithms. In numerous real-world applications large datasets are inevitably noisy and contain outliers. These outliers can dramatically degrade the performance of standard machine learning approaches such as regression trees. To this end, we present a novel distributed regression tree approach that utilizes robust regression statistics, statistics that are more robust to outliers, for handling large and noisy data. We propose to integrate robust statistics based error criteria into the regression tree. A data summarization method is developed and used to improve the efficiency of learning regression trees in the distributed setting. We implemented the proposed approach and baselines based on Apache Spark, a popular distributed data processing platform. Extensive experiments on both synthetic and real datasets verify the effectiveness and efficiency of our approach.

show abstract

“…[59]. Recent work in the literature has shown that MapReduce can be utilized to scale tasks in semantic classification [58] [60,61]. In this work, MapReduce is employed to speed up the IF-MCA algorithm.…”

Section: Feature Selectionmentioning

confidence: 99%

Exploring Hidden Coherent Feature Groups and Temporal Semantics for Multimedia Big Data Analysis

Yang¹

View full text Add to dashboard Cite

COMET: A Recipe for Learning and Using Large Ensembles on Massive Data

Cited by 31 publications

References 24 publications

Ensembles on Random Patches

Ensembles on Random Patches

Efficient Distributed Decision Trees for Robust Regression

Exploring Hidden Coherent Feature Groups and Temporal Semantics for Multimedia Big Data Analysis

Contact Info

Product

Resources

About