The Imbalanced Training Sample Problem: Under or over Sampling?

Barandela, Ricardo; Valdovinos, R. M.; Sánchez, J. Salvador; Ferri, Francesc J.

doi:10.1007/978-3-540-27868-9_88

Cited by 190 publications

(120 citation statements)

References 17 publications

Supporting

Mentioning

118

Contrasting

Unclassified

Order By: Relevance

“…This can be done by either over-sampling the minority class or under-sampling the majority class until both classes are approximately equally represented. Despite their advantages and popularity, several shortcomings remain associated with both these strategies because they artificially alter the prior class probabilities (Barandela et al, 2004): whilst under-sampling may result in throwing away potentially useful information about the majority class and perturbing the a priori probability of the training set (Dal Pozzolo et al, 2015), over-sampling worsens the computational burden of most classifiers, may increase the likelihood of overfitting and introduce noise that could result in a loss of performance.…”

Section: Resampling Methods To Handle Class Imbalancementioning

confidence: 99%

Associative learning on imbalanced environments: An empirical study

Cleofas-Sánchez

Sánchez

García

et al. 2016

Expert Systems with Applications

View full text Add to dashboard Cite

Associative memories have emerged as a powerful computational neural network model for several pattern classification problems. Like most traditional classifiers, these models assume that the classes share similar prior probabilities. However, in many real-life applications the ratios of prior probabilities between classes are extremely skewed. Although the literature has provided numerous studies that examine the performance degradation of renowned classifiers on different imbalanced scenarios, so far this effect has not been supported by a thorough empirical study in the context of associative memories. In this paper, we fix our attention on the applicability of the associative neural networks to the classification of imbalanced data. The key questions here addressed are whether these models perform better, the same or worse than other popular classifiers, how the level of imbalance affects their performance, and whether distinct resampling strategies produce a different impact on the associative memories. In order to answer these questions and gain further insight into the feasibility and efficiency of the associative memories, a large-scale experimental evaluation with 31 databases, seven classification models and four resampling algorithms is carried out here, along with a non-parametric statistical test to discover any significant differences between each pair of classifiers.

show abstract

Section: Resampling Methods To Handle Class Imbalancementioning

confidence: 99%

Associative learning on imbalanced environments: An empirical study

Cleofas-Sánchez

Sánchez

García

et al. 2016

Expert Systems with Applications

View full text Add to dashboard Cite

show abstract

“…In this case, if the current example label does not correspond to the label of their k-nearest neighbours, then the current example is eliminated. It is important to mention that the majority class decreases slightly in the number of the examples when the method searches their nearest neighbours inside of the majority class [38] The experiments were made taking into account a previous preprocessing in the data sets. Thus, the Wilson oversampling eliminates the patterns that are near to the decision boundary.…”

Section: Preprocessing Methodsmentioning

confidence: 99%

Equilibrating the Recognition of the Minority Class in the Imbalance Context

Cleofas-S¹,

Camacho-Nieto²

2014

Appl. Math. Inf. Sci.

View full text Add to dashboard Cite

Abstract:In pattern recognition, it is well known that the classifier performance depends on the classification rule and the complexities presented in the data sets (such as class overlapping, class imbalance, outliers, high-dimensional data sets among others). In this way, the issue of class imbalance is exhibited when one class is less represented with respect to the other classes. If the classifier is trained with imbalanced data sets, the natural tendency is to recognize the samples included in the majority class, ignoring the minority classes. This situation is not desirable because in real problems it is necessary to recognize the minority class more without sacrificing the precision of the majority class. In this work we analyze the behaviour of four classifiers taking into a count a relative balance among the accuracy classes.

show abstract

“…Several approaches have been proposed. 26,71 As it was described in section 3.2, the penalty for misclassified positive points should be increased to make false negative errors costlier than false positive ones. The implementation of this technique is dependent on the learning method.…”

Section: New Kernels For Support Vector Machinesmentioning

confidence: 99%

“…Therefore the in silico prediction of drug metabolism profiles of CYP450s has become one of the key technologies in early drug discovery support. [15][16][17][18][19][20][21][22][23][24] The primary aim of this paper is to demonstrate how to build useful classification models out of unbalanced [25][26][27][28] data sets. We consider a data set to be unbalanced if either the sizes of the two classes differ significantly, or the costs for a false negative classification are very high whereas a false positive is acceptable, or if both conditions hold.…”

Section: Introductionmentioning

confidence: 99%

Classification of Highly Unbalanced CYP450 Data of Drugs Using Cost Sensitive Machine Learning Techniques.

Eitrich¹,

Kless²,

Druska³

et al. 2007

ChemInform

View full text Add to dashboard Cite

In this paper, we study the classifications of unbalanced data sets of drugs. As an example we chose a data set of 2D6 inhibitors of cytochrome P450. The human cytochrome P450 2D6 isoform plays a key role in the metabolism of many drugs in the preclinical drug discovery process. We have collected a data set from annotated public data and calculated physicochemical properties with chemoinformatics methods. On top of this data, we have built classifiers based on machine learning methods. Data sets with different class distributions lead to the effect that conventional machine learning methods are biased toward the larger class. To overcome this problem and to obtain sensitive but also accurate classifiers we combine machine learning and feature selection methods with techniques addressing the problem of unbalanced classification, such as oversampling and threshold moving. We have used our own implementation of a support vector machine algorithm as well as the maximum entropy method. Our feature selection is based on the unsupervised McCabe method. The classification results from our test set are compared structurally with compounds from the training set. We show that the applied algorithms enable the effective high throughput in silico classification of potential drug candidates.

show abstract

The Imbalanced Training Sample Problem: Under or over Sampling?

Cited by 190 publications

References 17 publications

Associative learning on imbalanced environments: An empirical study

Associative learning on imbalanced environments: An empirical study

Equilibrating the Recognition of the Minority Class in the Imbalance Context

Classification of Highly Unbalanced CYP450 Data of Drugs Using Cost Sensitive Machine Learning Techniques.

Contact Info

Product

Resources

About