Stephen Kwek scite author profile

Abstract. Support Vector Machines (SVM) have been extensively studied and have shown remarkable success in many applications. However the success of SVM is very limited when it is applied to the problem of learning from imbalanced datasets in which negative instances heavily outnumber the positive instances (e.g. in gene profiling and detecting credit card fraud). This paper discusses the factors behind this failure and explains why the common strategy of undersampling the training data may not be the best choice for SVM. We then propose an algorithm for overcoming these problems which is based on a variant of the SMOTE algorithm by Chawla et al, combined with Veropoulos et al's different error costs algorithm. We compare the performance of our algorithm against these two algorithms, along with undersampling and regular SVM and show that our algorithm outperforms all of them.

show abstract

A boosting approach to remove class label noise

Karmaker

Kwek

2005

View full text Add to dashboard Cite

Optimal search for rationals

Kwek

Mehlhorn

2003

Information Processing Letters

View full text Add to dashboard Cite

PAC Learning Intersections of Halfspaces with Membership Queries

Kwek¹,

Pitt²

1998

Algorithmica

View full text Add to dashboard Cite

A randomized learning algorithm POLLY is presented that efficiently learns intersections of s halfspaces in n dimensions, in time polynomial in both s and n. The learning protocol is the PAC (probably approximately correct) model of Valiant, augmented with membership queries. In particular, POLLY receives a set S of m = poly(n, s, 1/ε, 1/δ) randomly generated points from an arbitrary distribution over the unit hypercube, and is told exactly which points are contained in, and which points are not contained in, the convex polyhedron P defined by the halfspaces. POLLY may also obtain the same information about points of its own choosing. It is shown that after poly(n, s, 1/ε, 1/δ, log(1/d)) time, the probability that POLLY fails to output a collection of s halfspaces with classification error at most ε, is at most δ. Here, d is the minimum distance between the boundary of the target and those examples in S that are not lying on the boundary. The parameter log(1/d) can be bounded by the number of bits needed to encode the coefficients of the bounding hyperplanes and the coordinates of the sampled examples S. Moreover, POLLY can be extended to learn unions of k disjoint polyhedra with each polyhedron having at most s facets, in time poly(n, k, s, 1/ε, 1/δ, log(1/d), 1/γ ) where γ is the minimum distance between any two distinct polyhedra.

show abstract

A data reduction approach for resolving the imbalanced data issue in functional genomics

Yoon

Kwek

2007

Neural Comput & Applic

View full text Add to dashboard Cite

Learning from imbalanced data occurs frequently in many machine learning applications. One positive example to thousands of negative instances is common in scientific applications. Unfortunately, traditional machine learning techniques often treat rare instances as noise. One popular approach for this difficulty is to resample the training data. However, this results in high false positive predictions. Hence, we propose preprocessing training data by partitioning them into clusters. This greatly reduces the imbalance between minority and majority instances in each cluster. For moderate imbalance ratio, our technique gives better prediction accuracy than other resampling method. For extreme imbalance ratio, this technique serves as a good filter that reduces the amount of imbalance so that traditional classification techniques can be deployed. More importantly, we have successfully applied our techniques to splice site prediction and protein subcellular localization problem, with significant improvements over previous predictors.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Stephen Kwek

Applying Support Vector Machines to Imbalanced Datasets

A boosting approach to remove class label noise

Optimal search for rationals

PAC Learning Intersections of Halfspaces with Membership Queries

A data reduction approach for resolving the imbalanced data issue in functional genomics

Contact Info

Product

Resources

About