Abstract. Support Vector Machines (SVM) have been extensively studied and have shown remarkable success in many applications. However the success of SVM is very limited when it is applied to the problem of learning from imbalanced datasets in which negative instances heavily outnumber the positive instances (e.g. in gene profiling and detecting credit card fraud). This paper discusses the factors behind this failure and explains why the common strategy of undersampling the training data may not be the best choice for SVM. We then propose an algorithm for overcoming these problems which is based on a variant of the SMOTE algorithm by Chawla et al, combined with Veropoulos et al's different error costs algorithm. We compare the performance of our algorithm against these two algorithms, along with undersampling and regular SVM and show that our algorithm outperforms all of them.
A randomized learning algorithm POLLY is presented that efficiently learns intersections of s halfspaces in n dimensions, in time polynomial in both s and n. The learning protocol is the PAC (probably approximately correct) model of Valiant, augmented with membership queries. In particular, POLLY receives a set S of m = poly(n, s, 1/ε, 1/δ) randomly generated points from an arbitrary distribution over the unit hypercube, and is told exactly which points are contained in, and which points are not contained in, the convex polyhedron P defined by the halfspaces. POLLY may also obtain the same information about points of its own choosing. It is shown that after poly(n, s, 1/ε, 1/δ, log(1/d)) time, the probability that POLLY fails to output a collection of s halfspaces with classification error at most ε, is at most δ. Here, d is the minimum distance between the boundary of the target and those examples in S that are not lying on the boundary. The parameter log(1/d) can be bounded by the number of bits needed to encode the coefficients of the bounding hyperplanes and the coordinates of the sampled examples S. Moreover, POLLY can be extended to learn unions of k disjoint polyhedra with each polyhedron having at most s facets, in time poly(n, k, s, 1/ε, 1/δ, log(1/d), 1/γ ) where γ is the minimum distance between any two distinct polyhedra.
Learning from imbalanced data occurs frequently in many machine learning applications. One positive example to thousands of negative instances is common in scientific applications. Unfortunately, traditional machine learning techniques often treat rare instances as noise. One popular approach for this difficulty is to resample the training data. However, this results in high false positive predictions. Hence, we propose preprocessing training data by partitioning them into clusters. This greatly reduces the imbalance between minority and majority instances in each cluster. For moderate imbalance ratio, our technique gives better prediction accuracy than other resampling method. For extreme imbalance ratio, this technique serves as a good filter that reduces the amount of imbalance so that traditional classification techniques can be deployed. More importantly, we have successfully applied our techniques to splice site prediction and protein subcellular localization problem, with significant improvements over previous predictors.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.