SMOTEBoost: Improving Prediction of the Minority Class in Boosting

Chawla, Nitesh V.; Lazarević, Aleksandar; Hall, Lawrence O.; Bowyer, Kevin W.

doi:10.1007/978-3-540-39804-2_12

Cited by 1,192 publications

(708 citation statements)

References 14 publications

Supporting

Mentioning

701

Contrasting

Unclassified

Order By: Relevance

“…For example, Chawla et al (2003) and Chen et al (2004) found that imbalance between the proportion of presence and absence classes can cause bias in the prediction and model-fit. They found that when an imbalanced sample is present, the bootstrap of the data is biased towards the majority class, thus over-predicting the majority-class and under-predicting the minority.…”

Section: Introductionmentioning

confidence: 99%

Multiple-scale prediction of forest loss risk across Borneo

et al. 2017

View full text Add to dashboard Cite

Context The forests of Borneo have among the highest biodiversity and also the highest forest loss rates on the planet. Objectives Our objectives were to: (1) compare multiple modelling approaches, (2) evaluate the utility of landscape composition and configuration as predictors, (3) assess the influence of the ratio of forest loss and persistence points in the training sample, (4) identify the multiple-scale drivers of recent forest loss and (5) predict future forest loss risk across Borneo. Methods We compared random forest machine learning and logistic regression in a multi-scale approach to model forest loss risk between 2000 and 2010 as a function of topographical variables and landscape structure, and applied the highest performing model to predict the spatial pattern of forest loss risk between 2010 and 2020. We utilized a naïve model as a null comparison and used the total operating characteristic AUC to assess model performance. Results Our analysis produced five main results. We found that: (1) random forest consistently outperformed logistic regression and the naïve model; (2) including landscape structure variables substantially improved predictions; (3) a ratio of occurrence to nonoccurrence points in the training dataset that does not match the actual ratio in the landscape biases the predictions of both random forest and logistic regression; (4) forest loss risk differed between the three nations that comprise Borneo, with patterns in Kalimantan highly related to distance from the edge of the previous frontier of forest loss, while Malaysian Borneo showed a more diffuse pattern related to the structure of the landscape; (5) we predicted continuing 123Landscape Ecol (2017) 32:1581-1598 DOI 10.1007 very high rates of forest loss in the 2010-2020 period, and produced maps of the expected risk of forest loss across the full extent of Borneo. Conclusions These results confirm that multiplescale modelling using landscape metrics as predictors in a random forest modelling framework is a powerful approach to landscape change modelling. There is immense immanent risk to Borneo's forests, with clear spatial patterns of risk related to topography and landscape structure that differ between the three nations that comprise Borneo.

show abstract

Section: Introductionmentioning

confidence: 99%

Multiple-scale prediction of forest loss risk across Borneo

et al. 2017

View full text Add to dashboard Cite

show abstract

“…The model was evaluated using F-measure, G-mean and Accuracy using seventeen imbalanced datasets including: Ionosphere, Hepatitis, Abalone, Yeast, Oil spills and Breast Cancer datasets. For each datasets the model was compared with C4.5, AdaBoostM1, DataBoost, CSB2, AdaCost [30] and SMOTEBoost [14]. The proposed model scored high on highly imbalanced datasets in terms of the F-measure and is comparable (in some instances higher) with other models when it comes to G-mean and Accuracy.…”

Section: Resultsmentioning

confidence: 99%

Exploring Feature-Level Duplications on Imbalanced Data Using Stochastic Diffusion Search

Alhakbani

al-Rifaie

2017

Multi-Agent Systems and Agreement Technologies

View full text Add to dashboard Cite

Abstract. Swarm intelligence mimics the behaviours of social insects like bees, wasps and ants to offer powerful problem solving metaheuristic which lies in a network of interactions amongst the agents of a multiagent system as well as with their environment. One of the computer algorithms inspired by swarm intelligence is the stochastic diffusion search (SDS). SDS uses some of the processes and techniques found in swarm to solve search and optimisation problems. In this paper, a hybrid approach is proposed to deal with real-world imbalanced data. The proposed model involves oversampling the minority class, undersampling the majority class as well as optimising the parameters of the classifier, Support Vector Machine (SVM). The proposed model uses Synthetic Minority Over-sampling Technique (SMOTE) to perform the oversampling and the agents of a swarm intelligence technique, SDS, to perform an 'informed' undersampling on the majority classes. The use of this swarm intelligence technique in conducting the undersampling tasks is investigated and its impact on improving the classification results is demonstrated. In addition to comparing the agents-led undersampling with random undersampling, the results are contrasted against other best known techniques on nine real-world datasets. Additionally, further experiments are designed to explore the behaviour of the SDS agents during the undersampling process.

show abstract

“…Bagging [10] and boosting [30] are two popular methods for building ensembles of classifiers with a rich history of extensions [17,31,39,61,74,78]. In this section we outline various approaches which have been taken to make bagging and boosting methods overcome concept drift.…”

Section: Bagging and Boosting Based Methodsmentioning

confidence: 99%

Learning from streaming data with concept drift and imbalance: an overview

2012

Self Cite

View full text Add to dashboard Cite

The primary focus of machine learning has traditionally been on learning from data assumed to be sufficient and representative of the underlying fixed, yet unknown, distribution. Such restrictions on the problem domain paved the way for development of elegant algorithms with theoretically provable performance guarantees. As is often the case, however, real-world problems rarely fit neatly into such restricted models. For instance class distributions are often skewed, resulting in the "class imbalance" problem. Data drawn from non-stationary distributions is also common in real-world applications, resulting in the "concept drift" or "non-stationary learning" problem which is often associated with streaming data scenarios. Recently, these problems have independently experienced increased research attention, however, the combined problem of addressing all of the above mentioned issues has enjoyed relatively little research. If the ultimate goal of intelligent machine learning algorithms is to be able to address a wide spectrum of real-world scenarios, then the need for a general framework for learning from, and adapting to, a non-stationary environment that may introduce imbalanced data can be hardly overstated. In this paper, we first present an overview of each of these challenging areas, followed by a comprehensive review of recent research for developing such a general framework.

show abstract

SMOTEBoost: Improving Prediction of the Minority Class in Boosting

Cited by 1,192 publications

References 14 publications

Multiple-scale prediction of forest loss risk across Borneo

Multiple-scale prediction of forest loss risk across Borneo

Exploring Feature-Level Duplications on Imbalanced Data Using Stochastic Diffusion Search

Learning from streaming data with concept drift and imbalance: an overview

Contact Info

Product

Resources

About