Real-world data often exhibit skewed distribution with a long tail, where certain target values have significantly fewer observations rather than preserving an ideal uniform distribution over each category, which substantially affects model performance for classification problems. Furthermore, parametric logistic regression provides a fundamental classification model with ease of interpretation; however, it is doubtful that the logit function of classification is truly linear in covariates. This research proposes the performance-based active learning (PbAL) scheme with nonparametric logistic regression to address the imbalance problem considering the nonlinear decision boundary. The PbAL is applied to choose the most informative samples in a sequential manner with an imbalanced dataset by directly evaluating a performance metric on a pool set. The nonparametric logistic regression model with smoothing splines is used to achieve a flexible classification boundary. The experiments show that PbAL outperforms traditional active learning approaches based on D-optimality and A-optimality. It is also shown that the proposed method provides superior outcomes compared to the other resampling techniques used for imbalanced classification problems, such as Tomek Link and SMOTE, even with a smaller sample size. This result suggests that PbAL effectively mitigates the bias, which severely influences the model performance with small amounts of initial training data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.