Exploring the Performance of Resampling Strategies for the Class Imbalance Problem

García, Vicente; Sánchez, J. Salvador; Mollineda, Ramón A.

doi:10.1007/978-3-642-13022-9_54

Cited by 21 publications

(10 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Data-level methods involve procedures applied in the training data to make the class distribution more b alanced by reducing the number of samples in more classes or increasing the number of samples in minority classes [18]. At present, the data-level method is mainly in the data p reprocessing stage, using resampling to redistribute the training data of different classes in the data space [19,20].…”

Section: ) Data-level Methodsmentioning

confidence: 99%

A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data

et al. 2021

View full text Add to dashboard Cite

Medical datasets are usually imbalanced, where negative cases severely outnumber p osit iv e cases. Therefore, it is essential to deal with this data skew problem when training machine learning algorithms. This study uses two representative lung cancer datasets, PLCO an d NLST, wit h imb alan ce ratios (the proportion of samples in the majority class to those in the minority class) of 24.7 and 25.0, respectively, to predict lung cancer incidence. This research uses the performance o f 23 clas s imb alan ce methods (resampling and hybrid systems) with three classical classifiers (logistic regression, random forest, and LinearSVC) to identify the best imbalance techniques suitable for medical datasets. Resampling includes ten under-sampling methods (RUS, Etc.), seven over-sampling methods (SMOTE, Etc.), an d t wo integrated sampling methods (SMOTEENN, SMOTE-Tomek). Hybrid systems include (Balanced Bagging, Etc.). The results show that class imbalance learning can improve the classification abilit y o f t h e mo d el. Compared with other imbalanced techniques, under-sampling techniques have the highest standard deviation (SD), and over-sampling techniques have the lowest SD. Over-sampling is a stable met h od, an d the AUC in the model is generally higher than in other ways. Using ROS, the random forest p erforms t h e best predictive ability and is more suitable for the lung cancer datasets used in this study.

show abstract

Section: ) Data-level Methodsmentioning

confidence: 99%

A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Although 118 primary studies were reviewed in this SLR, which were identified based on clear evidence regarding techniques applied to address key data-related issues, the number of SLR [169] N N N Y 1.0 [170] N N N Y 1.0 [171] N N N Y 1.0 [172] N N N Y 1.0 [173] N Y N Y 2.0 [174] N N N Y 1.0 [175] N N N Y 1.0 [176] N N N Y 1.0 [177] N N N Y 1.0 [178] N N N Y 1.0 [179] N N N Y 1.0 [180] N N N Y 1.0 [181] N N N Y 1.0 [182] N N N Y 1.0 Table 25 QA evaluation part 7 Primary study's reference ID QA1 QA2 QA3 QA4 Total score [183] N P N Y 1.5 [184] N P N Y 1.5 [185] N N N Y 1.0 [186] N N N Y 1.0 [187] N N N Y 1.0 [188] N N N Y 1.0 [189] N Y P Y 2.5 [190] N N N Y 1.0 [191] N N N Y 1.0 [192] N N N Y 1.0 [193] N N N Y 1.0 [194] N N N Y 1.0 [195] N N N Y 1.0 [196] N N N Y 1.0 [197] N N N Y 1.0 [198] N N N Y 1.0 [199] N N N Y 1.0 [200] N N N Y 1.0 [201] N N N Y 1.0 [202] N N N Y 1.0 [203] N N N Y 1.0 [204] N N N Y 1.0 [205] N N N Y 1.0 [206] N N N Y 1.0 [207] N N N Y 1.0 [208] N N N Y 1.0 studies on data preprocessing identified in this review is discou...…”

Section: Resultsmentioning

confidence: 99%

Systematic literature review of preprocessing techniques for imbalanced data

Felix

Lee

2019

IET softw.

View full text Add to dashboard Cite

Data preprocessing remains an important step in machine learning studies. This is because proper preprocessing of imbalanced data can enable researchers to reduce defects as much as possible, which, in turn, may lead to the elimination of defects in existing data sets. Despite the remarkable achievements that have been accomplished in machine learning studies, systematic literature reviews of imbalanced data preprocessing techniques are lacking. Consequently, there are a limited number of systematic literature review studies on imbalanced data preprocessing. In this study, the authors assess the existing literature to identify the key issues related to data quality and handling and to provide a convenient collection of the techniques used to address these issues when performing data preprocessing. They applied a systematic literature review method involving a manual search to select articles published from January 2010 to September 2018 for review. The qualities of the existing studies were assessed using certain quality assessment criteria. Of the 118 relevant studies found, only 2% were identified as having been conducted following systematic literature review guidelines. This study, therefore, calls for more systematic literature review studies on data preprocessing to improve the quality of the data applied in machine learning studies.

show abstract

“…The sequence of steps used to process each data set is listed in Algorithm 1, and described as follows: Data set balancing : As stated in [ 49 ], the samples within network captures are considerably smaller than those from benign applications, leading to the possibility of overfitting and classification downgrading. That being the case, algorithm estimations may always generalize the majority class features, overlapping the minority ones [ 50 ]; for example, in [ 51 ] the importance of data set balancing regarding a cervical cancer prediction model (CCPM) using risk factors as inputs was emphasized. In this case, the authors balanced their data set by using a synthetic minority over-sampling technique (SMOTE), due to their use of a Random Forest classifier.…”

Section: Proposed Frameworkmentioning

confidence: 99%

“…After the conversion, each series were replaced by values between 0 (representing the absence of addresses) and 1 (active values). Scaling : Numerical features were standardized to guarantee equal weights during the learning process [ 50 ]. Specifically, standard scaling was used on each numerical feature,

, to center its mean,

, and scale it with respect to the standard deviation

, as shown in Equation ( 1 ).…”

Section: Proposed Frameworkmentioning

confidence: 99%

A Dense Neural Network Approach for Detecting Clone ID Attacks on the RPL Protocol of the IoT

Morales-Molina

Hernandez-Suarez

Sánchez-Pérez

et al. 2021

Sensors

View full text Add to dashboard Cite

At present, new data sharing technologies, such as those used in the Internet of Things (IoT) paradigm, are being extensively adopted. For this reason, intelligent security controls have become imperative. According to good practices and security information standards, particularly those regarding security in depth, several defensive layers are required to protect information assets. Within the context of IoT cyber-attacks, it is fundamental to continuously adapt new detection mechanisms for growing IoT threats, specifically for those becoming more sophisticated within mesh networks, such as identity theft and cloning. Therefore, current applications, such as Intrusion Detection Systems (IDS), Intrusion Prevention Systems (IPS), and Security Information and Event Management Systems (SIEM), are becoming inadequate for accurately handling novel security incidents, due to their signature-based detection procedures using the matching and flagging of anomalous patterns. This project focuses on a seldom-investigated identity attack—the Clone ID attack—directed at the Routing Protocol for Low Power and Lossy Networks (RPL), the underlying technology for most IoT devices. Hence, a robust Artificial Intelligence-based protection framework is proposed, in order to tackle major identity impersonation attacks, which classical applications are prone to misidentifying. On this basis, unsupervised pre-training techniques are employed to select key characteristics from RPL network samples. Then, a Dense Neural Network (DNN) is trained to maximize deep feature engineering, with the aim of improving classification results to protect against malicious counterfeiting attempts.

show abstract

Exploring the Performance of Resampling Strategies for the Class Imbalance Problem

Cited by 21 publications

References 19 publications

A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data

A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data

Systematic literature review of preprocessing techniques for imbalanced data

A Dense Neural Network Approach for Detecting Clone ID Attacks on the RPL Protocol of the IoT

Contact Info

Product

Resources

About