Types of minority class examples and their influence on learning classifiers from imbalanced data

Napierała, Krystyna; Stefanowski, Jerzy

doi:10.1007/s10844-015-0368-1

Cited by 236 publications

(211 citation statements)

References 40 publications

Supporting

Mentioning

200

Contrasting

Unclassified

Order By: Relevance

“…For instance, if k = 5, the type of the example is assigned in the following way (Napierala and Stefanowski 2012;2016): 5:0 or 4:1 -an example is labeled as a safe example; 3:2 or 2:3 -a borderline example; 1:4 -labeled as a rare example; 0:5 -example is labeled as an outlier. This rule can be generalized for higher k values, however, results of recent experiments (Napierala and Stefanowski 2016) show that they lead to a similar categorization of considered datasets. Therefore, in the following study we stay with k = 5.…”

Section: Methodsmentioning

confidence: 96%

“…Recall that different difficulty factors could be considered: a fragmentation of the minority class into small disjuncts, overlapping of decision boundaries, presence of rare cases, outliers, noise (Stefanowski 2016a). Here we follow the methodology from Napieraha and Stefanowski (2012Stefanowski ( , 2016, where most of these data difficulty factors can be modeled by distinguishing the following types of examples: safe examples (located in the homogeneous regions populated by examples from one class only); borderline (placed close to the decision boundary between classes); rare examples (isolated groups of few examples located deeper inside the opposite class), or outliers.…”

Section: Methodsmentioning

confidence: 99%

“…For more details, the reader is referred to a recently published monograph (He and Ma 2013) covering the most representative issues and to the earlier systematic surveys, such as Chawla (2005), He andGarcia (2009), Sun et al (2009). A recent, comprehensive review of pre-processing methods could be found in Branco et al (2016) and their comparative studies are provided by Napierala and Stefanowski (2016), Van Hulse et al (2007).…”

Section: Preliminariesmentioning

confidence: 99%

“…Then, values start to fluctuate around certain levels. However, both datasets are the smallest ones as well as the distributions of the minority class are the most sparse and the most difficult ones (Napierala and Stefanowski 2016).…”

Section: The Influence Of the Number Of Component Classifiersmentioning

confidence: 99%

See 3 more Smart Citations

Multi-class and feature selection extensions of Roughly Balanced Bagging for imbalanced data

Lango

Stefanowski

2017

J Intell Inf Syst

Self Cite

View full text Add to dashboard Cite

Roughly Balanced Bagging is one of the most efficient ensembles specialized for class imbalanced data. In this paper, we study its basic properties that may influence its good classification performance. We experimentally analyze them with respect to bootstrap construction, deciding on the number of component classifiers, their diversity, and ability to deal with the most difficult types of the minority examples. Then, we introduce two generalizations of this ensemble for dealing with a higher number of attributes and for adapting it to handle multiple minority classes. Experiments with synthetic and real life data confirm usefulness of both proposals.

show abstract

Section: Methodsmentioning

confidence: 96%

Section: Methodsmentioning

confidence: 99%

Section: Preliminariesmentioning

confidence: 99%

Section: The Influence Of the Number Of Component Classifiersmentioning

confidence: 99%

See 2 more Smart Citations

Multi-class and feature selection extensions of Roughly Balanced Bagging for imbalanced data

Lango

Stefanowski

2017

J Intell Inf Syst

Self Cite

View full text Add to dashboard Cite

show abstract

“…Also, even the more traditional learning tasks, such as training classifiers, are now reformulated in more demanding ways, for instance, by taking into account additional constraints or data properties, like unusual distributions of examples and/or imbalance of target classes (Fernández et al, 2017;Napierala and Stefanowski, 2016). Such enriched input to the induction process requires more advanced and complex algorithms.…”

Section: Types Of Complex and Big Datamentioning

confidence: 99%

Exploring complex and big data

Stefanowski

Krawiec

Wrembel

2017

International Journal of Applied Mathematics and Computer Science

Self Cite

View full text Add to dashboard Cite

This paper shows how big data analysis opens a range of research and technological problems and calls for new approaches. We start with defining the essential properties of big data and discussing the main types of data involved. We then survey the dedicated solutions for storing and processing big data, including a data lake, virtual integration, and a polystore architecture. Difficulties in managing data quality and provenance are also highlighted. The characteristics of big data imply also specific requirements and challenges for data mining algorithms, which we address as well. The links with related areas, including data streams and deep learning, are discussed. The common theme that naturally emerges from this characterization is complexity. All in all, we consider it to be the truly defining feature of big data (posing particular research and technological challenges), which ultimately seems to be of greater importance than the sheer data volume.

show abstract

Particle swarm optimization–deep belief network–based rare class prediction model for highly class imbalance problem

Kim

Han

Lee

2017

Concurrency and Computation

View full text Add to dashboard Cite

Rare class imbalance problems, which involve the classification of minority or rare class, are difficult, because the size of the rare class is smaller than the majority class. Since majority class prediction is easy, its accuracy seems to be also high. However, the minority classes cannot be accurately predicted, and for this reason, when the prediction model performance is evaluated by considering only the accuracy, it does not indicate whether the model can predict the minority classes. Therefore, a rare class prediction technique is required. In this study, a rare class prediction model is proposed for minority class prediction. In addition, a dataset of a semiconductor manufacturing process with class imbalance problems was used to create a fault detection model. This prediction model uses data preprocessing to build the characteristics and data set required by the rare classes. To distinguish the rare classes related to the required characteristics, we used standard deviation and Euclidean distance to perform the feature selection. In addition, a particle swarm optimization-deep belief network was applied to create a classifier. The model proposed in this research presents outstanding performance and is appropriate for highly class imbalance problems. KEYWORDSclass imbalance problem, deep belief network, feature selection, particle swarm optimization, rare class classification INTRODUCTIONBecause of the issues with dig data and the development of deep learning techniques, the methods for building prediction models are in the spotlight. 1,2 Many AI-based prediction models, which use machine learning, data mining, databases, and statistical methods, are being proposed. Such prediction models based on state-of-the-art techniques are being applied in many fields, and there is a progressive increase in their industrial value. 3,4 For us to implement the prediction models accurately, it is necessary to analyze both domain knowledge and data. In addition, there is an increase in demand for obtaining useful knowledge from the collected data, and therefore, active research is being conducted on prediction models that are suitable for specific domains. 5,6 Thus, the importance of classification prediction techniques for class imbalance problems including class distribution, which is 1 of the main issues in the field of data mining, is increasing. 7-9 When the classes are balanced (balanced class), the ratios of the classes to be predicted are evenly distributed.Thus, by learning the data, a balanced predictive model that can predict all the classes can be generated. In the imbalance problem, the ratio of the category to be predicted is different. In this case, a classification prediction model that can predict only a specific class (rare class or majority class) is generated. For example, in the semiconductor manufacturing process, although most of the produced wafers are regular products, there is small probability for the production of irregular products. Therefore, a rare class prediction method is required to pre...

show abstract

Types of minority class examples and their influence on learning classifiers from imbalanced data

Cited by 236 publications

References 40 publications

Multi-class and feature selection extensions of Roughly Balanced Bagging for imbalanced data

Multi-class and feature selection extensions of Roughly Balanced Bagging for imbalanced data

Exploring complex and big data

Particle swarm optimization–deep belief network–based rare class prediction model for highly class imbalance problem

Contact Info

Product

Resources

About