SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data

Basgall, María José; Hasperué, Waldo; Naiouf, Marcelo; Fernández, Alberto; Herrera, Francisco

doi:10.24215/16666038.18.e23

Cited by 25 publications

(10 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Regarding the SMOTE algorithm, in [16] a global SMOTE fully scalable solu tion was described, called SMOTE-BD. In order to cope with the potential data partitioning problems, the whole neighborhood of each minority class instance is taken into account.…”

Section: Imbalanced Classification In Big Datamentioning

confidence: 99%

An Analysis of Local and Global Solutions to Address Big Data Imbalanced Classification: A Case Study with SMOTE Preprocessing

Basgall

Hasperué

Naiouf

et al. 2019

Communications in Computer and Information Science

Self Cite

View full text Add to dashboard Cite

Addressing the huge amount of data continuously generated is an important challenge in the Machine Learning field. The need to adapt the traditional techniques or create new ones is evident. To do so, distributed technologies have to be used to deal with the significant scalability constraints due to the Big Data context. In many Big Data applications for classification, there are some classes that are highly underrepresented, leading to what is known as the im balanced classification problem. In this scenario, learning algorithms are often biased towards the majority classes, treating minority ones as out liers or noise. Consequently, preprocessing techniques to balance the class distribution were developed. This can be achieved by suppressing majority instances (undersampling) or by creating minority examples (oversampling). Re garding the oversampling methods, one of the most widespread is the SMOTE algorithm, which creates artificial examples according to the neighborhood of each minority class instance. In this work, our objective is to analyze the SMOTE behavior in Big Data as a function of some key aspects such as the oversampling degree, the neighborhood value and, specially, the type of distributed design (local vs. global).

show abstract

Section: Imbalanced Classification In Big Datamentioning

confidence: 99%

An Analysis of Local and Global Solutions to Address Big Data Imbalanced Classification: A Case Study with SMOTE Preprocessing

Basgall

Hasperué

Naiouf

et al. 2019

Communications in Computer and Information Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…Their work was compared with various oversampling techniques on imbalanced low-and high-dimensional datasets, achieving a promising result to guarantee performance in constructing NLP application. Later, Maria et al [21] proposed a SMOTE-BD method to tackle the problem of imbalanced classification in big data. Their proposed scalable approach for imbalanced classification in big data is constructed on the basis of SMOTE algorithm, which helps create new synthetic instances according to the neighborhood of minority class sample.…”

Section: Smote Methodsmentioning

confidence: 99%

SMOTE-Boost-based sparse Bayesian model for flood prediction

Ding

Feng

2020

J Wireless Com Network

View full text Add to dashboard Cite

With a significant development of big data analysis and cloud-fog-edge computing, human-centered computing (HCC) has been a hot research topic worldwide. Essentially, HCC is a cross-disciplinary research domain, in which the core idea is to build an efficient interaction among persons, cyber space, and real world. Inspired by the improvement of HCC on big data analysis, we intend to involve related core and technologies to help solve one of the most important issues in the real world, i.e., flood prediction. To minimize the negative impacts brought by floods, researchers pay special attention to improve the accuracy of flood forecasting with quantity of technologies including HCC. However, historical flood data is essentially imbalanced. Imbalanced data causes machine learning classifiers to be more biased towards patterns with majority samples, resulting in poor classification of pattern with minority samples. In this paper, we propose a novel Synthetic Minority Over-sampling Technique (SMOTE)-Boost-based sparse Bayesian model to perform flood prediction with both high accuracy and robustness. The proposed model consists of three modules, namely, SMOTE-based data enhancement, AdaBoost training strategy, and sparse Bayes model construction. In SMOTE-based data enhancement, we adopt a SMOTE algorithm to effectively cover diverse data modes and generate more samples for prediction pattern with minority samples, which greatly alleviates the problem of imbalanced data by involving experts' analysis and users' intentions. During AdaBoost training strategy, we propose a specifically designed AdaBoost training strategy for sparse Bayesian model, which not only adaptively and inclemently increases prediction ability of Bayesian model, but also prevents its over-fitting performance. Essentially, the design of AdaBoost strategy helps keep balance between prediction ability and model complexity, which offers different but effective models over diverse rivers and users. Finally, we construct a sparse Bayesian model based on AdaBoost training strategy, which could offer flood prediction results with high rationality and robustness. We demonstrate the accuracy and effectiveness of the proposed model for flood prediction by conducting experiments on a collected dataset with several comparative methods.

show abstract

“…The scalability constraints regarding the volume of data adjacent to Big Data (BD) security appliances, the inherent complexity of data centers work flows [36] and the properties of nonstructured information [37] are attainable by implementing appropriate preprocessing stages, especially if SL algorithms only consider overall accuracy without taking into account relative class distribution. Random Oversampling for Big Data (ROS-Big Data), Random Undersampling for Big Data (RUS-BigData), and Map Reduce (MR) are some methods responsible for resampling extensive concentrations of evenly distributed data.…”

Section: Related Workmentioning

confidence: 99%

“…Random Oversampling for Big Data (ROS-Big Data), Random Undersampling for Big Data (RUS-BigData), and Map Reduce (MR) are some methods responsible for resampling extensive concentrations of evenly distributed data. As the authors described in [37], such techniques were applied by unifying a SMOTE variation for BD, obtaining, at its best, a favorable number of synthetic samples, avoiding some overgeneralization shortcomings to which SL are susceptible when handling a vast number of observations. Analogously, in [38], SMOTE was optimized by three major adjustments:…”

Section: Related Workmentioning

confidence: 99%

Synthetic Minority Oversampling Technique for Optimizing Classification Tasks in Botnet and Intrusion-Detection-System Datasets

et al. 2020

View full text Add to dashboard Cite

Presently, security is a hot research topic due to the impact in daily information infrastructure. Machine-learning solutions have been improving classical detection practices, but detection tasks employ irregular amounts of data since the number of instances that represent one or several malicious samples can significantly vary. In highly unbalanced data, classification models regularly have high precision with respect to the majority class, while minority classes are considered noise due to the lack of information that they provide. Well-known datasets used for malware-based analyses like botnet attacks and Intrusion Detection Systems (IDS) mainly comprise logs, records, or network-traffic captures that do not provide an ideal source of evidence as a result of obtaining raw data. As an example, the numbers of abnormal and constant connections generated by either botnets or intruders within a network are considerably smaller than those from benign applications. In most cases, inadequate dataset design may lead to the downgrade of a learning algorithm, resulting in overfitting and poor classification rates. To address these problems, we propose a resampling method, the Synthetic Minority Oversampling Technique (SMOTE) with a grid-search algorithm optimization procedure. This work demonstrates classification-result improvements for botnet and IDS datasets by merging synthetically generated balanced data and tuning different supervised-learning algorithms.

show abstract

SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data

Cited by 25 publications

References 11 publications

An Analysis of Local and Global Solutions to Address Big Data Imbalanced Classification: A Case Study with SMOTE Preprocessing

An Analysis of Local and Global Solutions to Address Big Data Imbalanced Classification: A Case Study with SMOTE Preprocessing

SMOTE-Boost-based sparse Bayesian model for flood prediction

Synthetic Minority Oversampling Technique for Optimizing Classification Tasks in Botnet and Intrusion-Detection-System Datasets

Contact Info

Product

Resources

About