2017
DOI: 10.1515/fcds-2017-0007
|View full text |Cite
|
Sign up to set email alerts
|

Difficulty Factors and Preprocessing in Imbalanced Data Sets: An Experimental Study on Artificial Data

Abstract: In this paper we describe results of an experimental study where we checked the impact of various difficulty factors in imbalanced data sets on the performance of selected classifiers applied alone or combined with several preprocessing methods. In the study we used artificial data sets in order to systematically check factors such as dimensionality, class imbalance ratio or distribution of specific types of examples (safe, borderline, rare and outliers) in the minority class. The results revealed that the lat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
16
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 21 publications
(16 citation statements)
references
References 28 publications
0
16
0
Order By: Relevance
“…Following the same line, Skryjomski & Krawczyk (2017) analyzed the structure of the minority class to transform the SMOTE algorithm into a selective over-sampling method focused on certain types of positive examples. Using two artificial data sets with different dimensions and imbalance ratios, Wojciechowski & Wilk (2017) found out that the critical factor affecting the true-positive rate was the distribution of sample types, while the impact of dimensionality and imbalance ratio was limited. Similarly, Stefanowski (2016) concluded that the performance of the most representative preprocessing approaches depends on the dominating type of minority examples.…”
Section: Distribution-based Data Irregularitiesmentioning
confidence: 99%
“…Following the same line, Skryjomski & Krawczyk (2017) analyzed the structure of the minority class to transform the SMOTE algorithm into a selective over-sampling method focused on certain types of positive examples. Using two artificial data sets with different dimensions and imbalance ratios, Wojciechowski & Wilk (2017) found out that the critical factor affecting the true-positive rate was the distribution of sample types, while the impact of dimensionality and imbalance ratio was limited. Similarly, Stefanowski (2016) concluded that the performance of the most representative preprocessing approaches depends on the dominating type of minority examples.…”
Section: Distribution-based Data Irregularitiesmentioning
confidence: 99%
“…[ 4 ] has applied ML to some simple artificial datasets. For rather specific ML questions, artificial data have been used in [ 5 , 6 ].…”
Section: State Of the Artmentioning
confidence: 99%
“…Data science comprises the preparation, analysis, and processing of both organized and unstructured massive information [16]. Evidence explaining real-world classification difficulties reveals imbalanced distribution in which one of the categories of decisions is underrepresented, often strongly compared to the other class [17]. Larger number of samples from one group would result in a classifier biased to the majority class [18].…”
Section: Data Imbalanced In the Research Scenariomentioning
confidence: 99%