2010
DOI: 10.1007/978-3-642-13022-9_54
|View full text |Cite
|
Sign up to set email alerts
|

Exploring the Performance of Resampling Strategies for the Class Imbalance Problem

Abstract: Abstract. The present paper studies the influence of two distinct factors on the performance of some resampling strategies for handling imbalanced data sets. In particular, we focus on the nature of the classifier used, along with the ratio between minority and majority classes. Experiments using eight different classifiers show that the most significant differences are for data sets with low or moderate imbalance: over-sampling clearly appears as better than under-sampling for local classifiers, whereas some … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
8
0

Year Published

2013
2013
2023
2023

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 21 publications
(10 citation statements)
references
References 19 publications
2
8
0
Order By: Relevance
“…Data-level methods involve procedures applied in the training data to make the class distribution more b alanced by reducing the number of samples in more classes or increasing the number of samples in minority classes [18]. At present, the data-level method is mainly in the data p reprocessing stage, using resampling to redistribute the training data of different classes in the data space [19,20].…”
Section: ) Data-level Methodsmentioning
confidence: 99%
“…Data-level methods involve procedures applied in the training data to make the class distribution more b alanced by reducing the number of samples in more classes or increasing the number of samples in minority classes [18]. At present, the data-level method is mainly in the data p reprocessing stage, using resampling to redistribute the training data of different classes in the data space [19,20].…”
Section: ) Data-level Methodsmentioning
confidence: 99%
“…Although 118 primary studies were reviewed in this SLR, which were identified based on clear evidence regarding techniques applied to address key data-related issues, the number of SLR [169] N N N Y 1.0 [170] N N N Y 1.0 [171] N N N Y 1.0 [172] N N N Y 1.0 [173] N Y N Y 2.0 [174] N N N Y 1.0 [175] N N N Y 1.0 [176] N N N Y 1.0 [177] N N N Y 1.0 [178] N N N Y 1.0 [179] N N N Y 1.0 [180] N N N Y 1.0 [181] N N N Y 1.0 [182] N N N Y 1.0 Table 25 QA evaluation part 7 Primary study's reference ID QA1 QA2 QA3 QA4 Total score [183] N P N Y 1.5 [184] N P N Y 1.5 [185] N N N Y 1.0 [186] N N N Y 1.0 [187] N N N Y 1.0 [188] N N N Y 1.0 [189] N Y P Y 2.5 [190] N N N Y 1.0 [191] N N N Y 1.0 [192] N N N Y 1.0 [193] N N N Y 1.0 [194] N N N Y 1.0 [195] N N N Y 1.0 [196] N N N Y 1.0 [197] N N N Y 1.0 [198] N N N Y 1.0 [199] N N N Y 1.0 [200] N N N Y 1.0 [201] N N N Y 1.0 [202] N N N Y 1.0 [203] N N N Y 1.0 [204] N N N Y 1.0 [205] N N N Y 1.0 [206] N N N Y 1.0 [207] N N N Y 1.0 [208] N N N Y 1.0 studies on data preprocessing identified in this review is discou...…”
Section: Resultsmentioning
confidence: 99%
“…The sequence of steps used to process each data set is listed in Algorithm 1, and described as follows: Data set balancing : As stated in [ 49 ], the samples within network captures are considerably smaller than those from benign applications, leading to the possibility of overfitting and classification downgrading. That being the case, algorithm estimations may always generalize the majority class features, overlapping the minority ones [ 50 ]; for example, in [ 51 ] the importance of data set balancing regarding a cervical cancer prediction model (CCPM) using risk factors as inputs was emphasized. In this case, the authors balanced their data set by using a synthetic minority over-sampling technique (SMOTE), due to their use of a Random Forest classifier.…”
Section: Proposed Frameworkmentioning
confidence: 99%
“…After the conversion, each series were replaced by values between 0 (representing the absence of addresses) and 1 (active values). Scaling : Numerical features were standardized to guarantee equal weights during the learning process [ 50 ]. Specifically, standard scaling was used on each numerical feature, , to center its mean, , and scale it with respect to the standard deviation , as shown in Equation ( 1 ).…”
Section: Proposed Frameworkmentioning
confidence: 99%