Abstract-Imbalanced classification deals with learning from data with a disproportional number of samples in its classes. Traditional classifiers exhibit poor behavior when facing this kind of data because they do not take into account the imbalanced class distribution. Four main kinds of solutions exist to solve this problem: modifying the data distribution, modifying the learning algorithm for considering the imbalance representation, including the use of costs for data samples, and ensemble methods. In this paper, we adopt the second type of solution and introduce a classification algorithm for imbalanced data that uses fuzzy rough set theory and ordered weighted average aggregation. The proposal considers different strategies to build a weight vector to take into account data imbalance. Our methods are validated by an extensive experimental study, showing statistically better results than 13 other state-of-the-art methods.Index Terms-Fuzzy rough sets, imbalanced classification, machine learning, ordered weighted average (OWA).
Classification techniques in the big data scenario are in high demand in a wide variety of applications. The huge increment of available data may limit the applicability of most of the standard techniques. This problem becomes even more difficult when the class distribution is skewed, the topic known as imbalanced big data classification. Evolutionary undersampling techniques have shown to be a very promising solution to deal with the class imbalance problem. However, their practical application is limited to problems with no more than tens of thousands of instances.In this contribution we design a parallel model to enable evolutionary undersampling methods to deal with large-scale problems. To do this, we rely on a MapReduce scheme that distributes the functioning of these kinds of algorithms in a cluster of computing elements. Moreover, we develop a windowing approach for class imbalance data in order to speed up the undersampling process without losing accuracy. In our experiments we test the capabilities of the proposed scheme with several data sets with up to 4 million instances. The results show promising scalability abilities for evolutionary undersampling within the proposed framework.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.