Parallel Simultaneous Co-clustering and Learning with Map-Reduce

Deodhar, Meghana; Jones, Clinton; Ghosh, Joydeep

doi:10.1109/grc.2010.54

Cited by 11 publications

(7 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The complexities of data distribution, parallel computation, and resource scheduling are managed by the MapReduce frame-work [3]. In contrast to many previous uses of MapReduce to scale up machine learning that require multiple passes over the data [4]- [9], this approach requires only a single pass (single MapReduce step) to construct the entire ensemble. This minimizes disk I/O and the overhead of setting up and shutting down MapReduce jobs.…”

Section: Introductionmentioning

confidence: 99%

COMET: A Recipe for Learning and Using Large Ensembles on Massive Data

Basilico¹,

Munson²,

Kolda³

et al. 2011

2011 IEEE 11th International Conference on Data Mining

View full text Add to dashboard Cite

COMET is a single-pass MapReduce algorithm for learning on large-scale data. It builds multiple random forest ensembles on distributed blocks of data and merges them into a mega-ensemble. This approach is appropriate when learning from massive-scale data that is too large to fit on a single machine. To get the best accuracy, IVoting should be used instead of bagging to generate the training subset for each decision tree in the random forest. Experiments with two large datasets (5GB and 50GB compressed) show that COMET compares favorably (in both accuracy and training time) to learning on a subsample of data using a serial algorithm. Finally, we propose a new Gaussian approach for lazy ensemble evaluation which dynamically decides how many ensemble members to evaluate per data point; this can reduce evaluation cost by 100X or more

show abstract

Section: Introductionmentioning

confidence: 99%

COMET: A Recipe for Learning and Using Large Ensembles on Massive Data

Basilico¹,

Munson²,

Kolda³

et al. 2011

2011 IEEE 11th International Conference on Data Mining

View full text Add to dashboard Cite

show abstract

“…Each co-cluster fits a prediction model. Deodhar et al 20 also present a parallel version of the SCOAL algorithm. Our paper is inspired by these papers but addresses the fuzziness feature.…”

Section: Parallel Machine-learning Algorithmsmentioning

confidence: 99%

FSCOAL-Parallel simultaneous fuzzy co-clustering and learning

Biton

Kalech

Rokach

2018

Int. J. Intell. Syst.

View full text Add to dashboard Cite

A model‐based co‐clustering divides the data based on two main axes and simultaneously trains a supervised model for each co‐cluster using all other input features. For example, in the rating prediction task of recommender system, the main two axes are items and users. In each co‐cluster, we train a regression model for predicting the rating based on other features such as user's characteristics (e.g., gender), item's characteristics (e.g., genre), contextual features (e.g., location), and so on. In reality, users and items do not necessarily belong to a single co‐cluster, but rather can be associated with several co‐clusters. We extend the model‐based co‐clustering to support fuzzy co‐clustering. In this setting, each item–user pair is associated to every co‐cluster with some membership grade. This grade indicates the level of relevance of the item–user pair to the co‐cluster. Furthermore, we propose a distributed algorithm, based on a map‐reduce approach, to handle big datasets. Evaluating the fuzzy co‐clustering algorithm on three datasets shows a significant improvement comparing with a regular co‐clustering algorithm. In addition, a map‐reduce version of the fuzzy co‐clustering algorithm significantly reduces the runtime.

show abstract

“…In addition, we discuss parallelization of pre-clustering called Canopy clustering [17]. We then cover MapReduce algorithms for hierarchical clustering [22], density-based clustering [11] and co-clustering [9,21].…”

Section: Data Miningmentioning

confidence: 99%

MapReduce algorithms for big data analysis

Shim

2012

Proc. VLDB Endow.

View full text Add to dashboard Cite

There is a growing trend of applications that should handle big data. However, analyzing big data is a very challenging problem today. For such applications, the MapReduce framework has recently attracted a lot of attention. Google's MapReduce or its open-source equivalent Hadoop is a powerful tool for building such applications. In this tutorial, we will introduce the MapReduce framework based on Hadoop, discuss how to design efficient MapReduce algorithms and present the state-of-the-art in MapReduce algorithms for data mining, machine learning and similarity joins. The intended audience of this tutorial is professionals who plan to design and develop MapReduce algorithms and researchers who should be aware of the state-of-the-art in MapReduce algorithms available today for big data analysis. 2016

show abstract

Parallel Simultaneous Co-clustering and Learning with Map-Reduce

Cited by 11 publications

References 16 publications

COMET: A Recipe for Learning and Using Large Ensembles on Massive Data

COMET: A Recipe for Learning and Using Large Ensembles on Massive Data

FSCOAL-Parallel simultaneous fuzzy co-clustering and learning

MapReduce algorithms for big data analysis

Contact Info

Product

Resources

About