On the use of MapReduce for imbalanced big data using Random Forest

Río, Sara del; López, Victoria; Benítez, José Manuel; Herrera, Francisco

doi:10.1016/j.ins.2014.03.043

Cited by 258 publications

(78 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…-SMOTE-based oversampling methods applied in distributed environments such as MapReduce tend to fail [13]. This can be caused by a random partitioning of data for each mapper and thus introducing artificial samples on the basis of real objects that have no spatial relationships.…”

Section: Imbalanced Big Datamentioning

confidence: 99%

Learning from imbalanced data: open challenges and future directions

2016

View full text Add to dashboard Cite

Despite more than two decades of continuous development learning from imbalanced data is still a focus of intense research. Starting as a problem of skewed distributions of binary tasks, this topic evolved way beyond this conception. With the expansion of machine learning and data mining, combined with the arrival of big data era, we have gained a deeper insight into the nature of imbalanced learning, while at the same time facing new emerging challenges. Data-level and algorithm-level methods are constantly being improved and hybrid approaches gain increasing popularity. Recent trends focus on analyzing not only the disproportion between classes, but also other difficulties embedded in the nature of data. New real-life problems motivate researchers to focus on computationally efficient, adaptive and real-time methods. This paper aims at discussing open issues and challenges that need to be addressed to further develop the field of imbalanced learning. Seven vital areas of research in this topic are identified, covering the full spectrum of learning from imbalanced data: classification, regression, clustering, data streams, big data analytics and applications, e.g., in social media and computer vision. This paper provides a discussion and suggestions concerning lines of future research for each of them.

show abstract

Section: Imbalanced Big Datamentioning

confidence: 99%

Learning from imbalanced data: open challenges and future directions

2016

View full text Add to dashboard Cite

show abstract

“…For this problem, two pre-processing algorithms were applied. First, the Random OverSampling (ROS) algorithm used in [18] was applied in order to replicate the minority class instances from the original dataset until the number of instances for both classes was equalized, summing a total of 65 millions instances. Finally, for DITFS algorithm, the dataset has been discretized using the Minimum Description Length Principle (MDLP) discretizer [19].…”

Section: Resultsmentioning

confidence: 99%

A comparison on scalability for batch big data processing on Apache Spark and Apache Flink

et al. 2017

Self Cite

View full text Add to dashboard Cite

The large amounts of data have created a need for new frameworks for processing. The MapReduce model is a framework for processing and generating large-scale datasets with parallel and distributed algorithms. Apache Spark is a fast and general engine for large-scale data processing based on the MapReduce model. The main feature of Spark is the in-memory computation. Recently a novel framework called Apache Flink has emerged, focused on distributed stream and batch data processing. In this paper we perform a comparative study on the scalability of these two frameworks using the corresponding Machine Learning libraries for batch data processing. Additionally we analyze the performance of the two Machine Learning libraries that Spark currently has, MLlib and ML. For the experiments, the same algorithms and the same dataset are being used. Experimental results show that Spark MLlib has better perfomance and overall lower runtimes than Flink.

show abstract

“…No reducer is also possible. For example, del Río et al [63] used only multiple mappers (no reducer) for network intrusion detection with rForest. Since various decision trees are generated from multiple mappers, they considered using all outcomes (i.e.…”

Section: Implementation Examplesmentioning

confidence: 99%

A survey of cloud-based network intrusion detection analysis

Keegan

Chaudhary

et al. 2016

Hum. Cent. Comput. Inf. Sci.

View full text Add to dashboard Cite

As network traffic grows and attacks become more prevalent and complex, we must find creative new ways to enhance intrusion detection systems (IDSes). Recently, researchers have begun to harness both machine learning and cloud computing technology to better identify threats and speed up computation times. This paper explores current research at the intersection of these two fields by examining cloud-based network intrusion detection approaches that utilize machine learning algorithms (MLAs). Specifically, we consider clustering and classification MLAs, their applicability to modern intrusion detection, and feature selection algorithms, in order to underline prominent implementations from recent research. We offer a current overview of this growing body of research, highlighting successes, challenges, and future directions for MLA-usage in cloud-based network intrusion detection approaches.

show abstract

On the use of MapReduce for imbalanced big data using Random Forest

Cited by 258 publications

References 43 publications

Learning from imbalanced data: open challenges and future directions

Learning from imbalanced data: open challenges and future directions

A comparison on scalability for batch big data processing on Apache Spark and Apache Flink

A survey of cloud-based network intrusion detection analysis

Contact Info

Product

Resources

About