Entropy and improved k‐nearest neighbor search based under‐sampling (ENU) method to handle class overlap in imbalanced datasets

Kumar, Anil; Singh, Dinesh; Yadav, Rama Shankar

doi:10.1002/cpe.7894

Cited by 8 publications

(1 citation statement)

References 79 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Previous studies have extensively investigated methods to improve model accuracy on imbalanced data, including data resampling (such as under-sampling majority classes [19][20][21][22] or over-sampling minority classes [23][24][25] ), generating synthetic data 1,13,14 and cost-sensitive learning 26,27 that assigns higher weights to the loss of minority classes. However, these studies mostly focused on imbalanced classification, with few efforts dedicated to addressing imbalanced regression tasks.…”

Section: Imbalanced Learningmentioning

confidence: 99%

Boosting semi‐supervised learning under imbalanced regression via pseudo‐labeling

Zong,

Su,

Zhou

2024

Concurrency and Computation

View full text Add to dashboard Cite

SummaryImbalanced samples are widespread, which impairs the generalization and fairness of models. Semi‐supervised learning can overcome the deficiency of rare labeled samples, but it is challenging to select high‐quality pseudo‐label data. Unlike discrete labels that can be matched one‐to‐one with points on a numerical axis, labels in regression tasks are consecutive and cannot be directly chosen. Besides, the distribution of unlabeled data is imbalanced, which easily leads to an imbalanced distribution of pseudo‐label data, exacerbating the imbalance in the semi‐supervised dataset. To solve this problem, this article proposes a semi‐supervised imbalanced regression network (SIRN), which consists of two components: A, designed to learn the relationship between features and labels (targets), and B, dedicated to learning the relationship between features and target deviations. To measure target deviations under imbalanced distribution, the target deviation function is introduced. To select continuous pseudo‐labels, the deviation matching strategy is designed. Furthermore, an adaptive selection function is developed to mitigate the risk of skewed distributions due to imbalanced pseudo‐label data. Finally, the effectiveness of the proposed method is validated through evaluations of two regression tasks. The results show a great reduction in predicted value error, particularly in few‐shot regions. This empirical evidence confirms the efficacy of our method in addressing the issue of imbalanced samples in regression tasks.

show abstract