2004
DOI: 10.1007/978-3-540-27868-9_88
|View full text |Cite
|
Sign up to set email alerts
|

The Imbalanced Training Sample Problem: Under or over Sampling?

Abstract: Abstract. The problem of imbalanced training sets in supervised pattern recognition methods is receiving growing attention. Imbalanced training sample means that one class is represented by a large number of examples while the other is represented by only a few. It has been observed that this situation, which arises in several practical domains, may produce an important deterioration of the classification accuracy, in particular with patterns belonging to the less represented classes. In this paper we present … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
118
0
2

Year Published

2006
2006
2018
2018

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 190 publications
(120 citation statements)
references
References 17 publications
0
118
0
2
Order By: Relevance
“…This can be done by either over-sampling the minority class or under-sampling the majority class until both classes are approximately equally represented. Despite their advantages and popularity, several shortcomings remain associated with both these strategies because they artificially alter the prior class probabilities (Barandela et al, 2004): whilst under-sampling may result in throwing away potentially useful information about the majority class and perturbing the a priori probability of the training set (Dal Pozzolo et al, 2015), over-sampling worsens the computational burden of most classifiers, may increase the likelihood of overfitting and introduce noise that could result in a loss of performance.…”
Section: Resampling Methods To Handle Class Imbalancementioning
confidence: 99%
“…This can be done by either over-sampling the minority class or under-sampling the majority class until both classes are approximately equally represented. Despite their advantages and popularity, several shortcomings remain associated with both these strategies because they artificially alter the prior class probabilities (Barandela et al, 2004): whilst under-sampling may result in throwing away potentially useful information about the majority class and perturbing the a priori probability of the training set (Dal Pozzolo et al, 2015), over-sampling worsens the computational burden of most classifiers, may increase the likelihood of overfitting and introduce noise that could result in a loss of performance.…”
Section: Resampling Methods To Handle Class Imbalancementioning
confidence: 99%
“…In this case, if the current example label does not correspond to the label of their k-nearest neighbours, then the current example is eliminated. It is important to mention that the majority class decreases slightly in the number of the examples when the method searches their nearest neighbours inside of the majority class [38] The experiments were made taking into account a previous preprocessing in the data sets. Thus, the Wilson oversampling eliminates the patterns that are near to the decision boundary.…”
Section: Preprocessing Methodsmentioning
confidence: 99%
“…Several approaches have been proposed. 26,71 As it was described in section 3.2, the penalty for misclassified positive points should be increased to make false negative errors costlier than false positive ones. The implementation of this technique is dependent on the learning method.…”
Section: New Kernels For Support Vector Machinesmentioning
confidence: 99%
“…Therefore the in silico prediction of drug metabolism profiles of CYP450s has become one of the key technologies in early drug discovery support. [15][16][17][18][19][20][21][22][23][24] The primary aim of this paper is to demonstrate how to build useful classification models out of unbalanced [25][26][27][28] data sets. We consider a data set to be unbalanced if either the sizes of the two classes differ significantly, or the costs for a false negative classification are very high whereas a false positive is acceptable, or if both conditions hold.…”
Section: Introductionmentioning
confidence: 99%