Class imbalance learning (CIL) has become one of the most challenging research topics. In this article, we propose a Boosted co-training method to modify the class distribution so that traditional classifiers can be readily adapted to imbalanced datasets. This article is among the first to utilize pseudo-labelled data of co-training to enlarge the training set of minority classes. Compared with existing oversampling methods which generate minority samples based on labelled data, the proposed method has the ability to learn from unlabelled data and then decrease the risk of overfitting. Furthermore, we propose a boosting-style technique which implicitly modifies the class distribution and combines it with co-training to alleviate the bias towards majority classes. Finally, we collect two series of classifiers generated during Boosted co-training to build an ensemble for the classification. It further improves the CIL performance by leveraging the strength of ensemble learning. By taking advantage of the diversity of co-training, we also contribute a new approach to generating base classifiers for ensemble learning. The proposed method is compared with eight state-of-the-art CIL methods on a variety of benchmark datasets. Measured by G-Mean, F-Measure, and AUC, Boosted co-training achieves the best performances and average ranks on 18 benchmark datasets. The experimental results demonstrate the significant superiority of Boosted co-training over other CIL methods. K E Y W O R D S boosting, class-imbalanced learning, co-training, over-sampling, pseudo-labelled data
| INTRODUCTIONMany real-world classification tasks suffer from the class imbalance problem, where minority classes are highly underrepresented as compared to majority classes. Note that traditional classifiers are designed to output the hypothesis that minimizes the overall prediction error. As a result, they are apt to be biased towards majority classes and thereby perform poorly on minority classes (Kaur et al., 2019). However, minority classes are usually more valuable in real applications, such as fraud detection, medical diagnosis, spam classification, and many others. For example, in rare disease diagnoses, a classifier that identifies all patients as normal cases is useless even if it achieves 99% accuracy. Therefore, learning from class-imbalanced data has become one of the most challenging topics in machine learning. Numerous CIL techniques have been proposed over the past decades, which can be roughly grouped into the following two categories:i. Data-level methods try to preprocess (e.g., oversampling (Chawla et al., 2002) or undersampling (Kubat & Matwin, 1997)) a dataset to make it suitable for standard classification algorithms. This approach is classifier-independent. However, generating a perfectly balanced distribution does not always provide an optimal result for classification tasks (Wu & Chang, 2003).