Many important learning problems, from a wide variety of domains, involve learning from imbalanced data. Because this learning task is quite challenging, there has been a tremendous amount of research on this topic over the past fifteen years. However, much of this research has focused on methods for dealing with imbalanced data, without discussing exactly how or why such methods work-or what underlying issues they address. This is a significant oversight, which this chapter helps to address. This chapter begins by describing what is meant by imbalanced data, and by showing the effects of such data on learning. It then describes the fundamental learning issues that arise when learning from imbalanced data, and categorizes these issues as D R A F T July 9, 2012, 11:10pm D R A F T 2 FOUNDATIONS OF IMBALANCED LEARNING either problem definition level issues, data level issues, or algorithm level issues. The chapter then describes the methods for addressing these issues and organizes these methods using the same three categories. As one example, the data-level issue of "absolute rarity" (i.e., not having sufficient numbers of minority-class examples to properly learn the decision boundaries for the minority class) can best be addressed using a data-level method that acquires additional minority-class training examples. But as we shall see in this chapter, sometimes such a direct solution is not available and less direct methods must be utlized. Common misconceptions are also discussed and explained.Overall, this chapter provides an understanding of the foundations of imbalanced learning by providing a clear description of the relevant issues, and a clear mapping from these issues to the methods that can be used to address them.
INTRODUCTIONMany of the machine learning and data mining problems that we study, whether they are in business, science, medicine, or engineering, involve some form of data imbalance. The imbalance is often an integral part of the problem and in virtually every case the less frequently occurring entity is the one that we are most interested in. For example, those working on fraud detection will focus on identifying the fraudulent transactions rather than the more common legitimate transactions [1], a telecommunications engineer will be far more interested in identifying equipment about to fail than equipment that will remain operational [2], and an industrial engineer will be more likely to focus on weld flaws than on welds that are completed satisfactorily [3].In all of these situations it is far more important to accurately predict or identify the rarer case than the more common case, and this is reflected in the costs associated with errors in the predictions and classifications. For example, if we predict that telecommunication equipment is going to fail and it does not, we may incur some modest inconvenience and cost if the equipment is D R A F T July 9, 2012, 11:10pm D R A F T 4 FOUNDATIONS OF IMBALANCED LEARNING 2.2.1 What is an Imbalanced Data Set and what is its Impact on Learning?We be...