Borderline over-sampling for imbalanced data classification

Nguyen, Hien M.; Cooper, Eric W.; Kamei, Katsuhiko

doi:10.1504/ijkesdp.2011.039875

Cited by 460 publications

(236 citation statements)

References 26 publications

Supporting

Mentioning

206

Contrasting

Unclassified

Order By: Relevance

“…A more complex over-sampling technique interpolates synthetic minority instances between two existing ones [3]. Some studies have found that over-sampling of the minority class in borderline regions can provide better results [6], [10]. The simplest under-sampling technique is to randomly remove a number of majority instances, while a more intelligent technique [16] discards only those majority instances that are redundant, borderline, or noisy.…”

Section: Background and Related Workmentioning

confidence: 99%

“…Standard learning algorithms usually make a bias toward the majority class to increase overall accuracy, and therefore reducing the predictive accuracy on the minority class. The most popular approach for handling the class imbalance problem is to rebalance the training set using sampling techniques [1]- [10]. An advantage of sampling is that we can simply apply a standard learning algorithm to a rebalanced training set, without a need to modify that algorithm.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A comparative study on sampling techniques for handling class imbalance in streaming data

Nguyen

Cooper

Kamei

2012

The 6th International Conference on Soft Computing and Intelligent Systems, and the 13th International Symposium on Advanced In

Self Cite

View full text Add to dashboard Cite

Sampling is the most popular approach for handling the class imbalance problem in training data. A number of studies have recently adapted sampling techniques for dynamic learning settings in which the training set is not fixed, but gradually grows over time. This paper presents an empirical study to compare over-sampling and under-sampling techniques in the context of data streaming. Experimental results show that undersampling performs better than over-sampling at smaller training set sizes. All sampling techniques, however, are comparable when the training set becomes larger. This study also suggests that a multiple random under-sampling (MRUS) technique should be a good choice for applications with imbalanced and streaming data, because MRUS is the most effective while still keeping a high speed.

show abstract

Section: Background and Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A comparative study on sampling techniques for handling class imbalance in streaming data

Nguyen

Cooper

Kamei

2012

The 6th International Conference on Soft Computing and Intelligent Systems, and the 13th International Symposium on Advanced In

Self Cite

View full text Add to dashboard Cite

show abstract

“…SMOTE introduces artificial instances in data sets by interpolating features values based on neighbors. In several studies have been shown that SMOTE is better than under-sampling and over-sampling techniques [3][4][5][6][7]. Moreover, SMOTE not cause any information loss and could potentially find hidden minority regions, because SMOTE identify similar but more specific regions in the feature space as the decision region for the minority class.…”

Section: Introductionmentioning

confidence: 99%

PSO-Based Method for SVM Classification on Skewed Data-Sets

Cervantes

García-Lamont

López³

et al. 2015

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Support Vector Machines (SVM) have shown excellent generalization power in classification problems. However, on skewed data-sets, SVM learns a biased model that affects the classifier performance, which is severely damaged when the unbalanced ratio is very large. In this paper, a new external balancing method for applying SVM on skewed data sets is developed. In the first phase of the method, the separating hyperplane is computed. Support vectors are then used to generate the initial population of PSO algorithm, which is used to improve the population of artificial instances and to eliminate noise instances. Experimental results demonstrate the ability of the proposed method to improve the performance of SVM on imbalanced data-sets.

show abstract

“…However, the classes in general cannot be assumed to be convex, and hence the SMOTE does not avoid synthetic patterns to fall inside majority regions, therefore, more careful techniques have been developed to prevent this issue (prevent, but not solve). Adaptive synthetic [5]- [7] and cluster-based sampling methods [8], [9] are examples of more powerful techniques, based on extracting knowledge from the data to analyze which patterns and regions of the space are more suitable for oversampling. This will be referred in this paper to as preferential oversampling.…”

mentioning

confidence: 99%

“…Ideally, a better fitted kernel will increase the class separability, providing a safer environment for the generation of synthetic patterns. The last part of this paper proposes a unified adaptive framework for preferential oversampling generalizing several oversampling approaches in the literature [3], [5], [6]. The optimal SVM hyperplane and kernel learning techniques are used for optimizing the synthetically generated patterns.…”

mentioning

confidence: 99%

Oversampling the Minority Class in the Feature Space

Pérez-Ortiz

Gutiérrez

Tiňo

et al. 2016

IEEE Trans. Neural Netw. Learning Syst.

View full text Add to dashboard Cite

The imbalanced nature of some real-world data is one of the current challenges for machine learning researchers. One common approach oversamples the minority class through convex combination of its patterns. We explore the general idea of synthetic oversampling in the feature space induced by a kernel function (as opposed to input space). If the kernel function matches the underlying problem, the classes will be linearly separable and synthetically generated patterns will lie on the minority class region. Since the feature space is not directly accessible, we use the empirical feature space (EFS) (a Euclidean space isomorphic to the feature space) for oversampling purposes. The proposed method is framed in the context of support vector machines, where the imbalanced data sets can pose a serious hindrance. The idea is investigated in three scenarios: 1) oversampling in the full and reduced-rank EFSs; 2) a kernel learning technique maximizing the data class separation to study the influence of the feature space structure (implicitly defined by the kernel function); and 3) a unified framework for preferential oversampling that spans some of the previous approaches in the literature. We support our investigation with extensive experiments over 50 imbalanced data sets.

show abstract

Borderline over-sampling for imbalanced data classification

Cited by 460 publications

References 26 publications

A comparative study on sampling techniques for handling class imbalance in streaming data

A comparative study on sampling techniques for handling class imbalance in streaming data

PSO-Based Method for SVM Classification on Skewed Data-Sets

Oversampling the Minority Class in the Feature Space

Contact Info

Product

Resources

About