Smote vs. Random Undersampling for Imbalanced Data - Car Ownership Demand Model

Chaipanha, Wuttikrai; Kaewwichian, Patiphan

doi:10.26552/com.c.2022.3.d105-d115

Cited by 7 publications

(4 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Random Oversampling performs random replication on minority samples to balance the class distribution [7]. Meanwhile Random Undersampling used to balance the distribution of each class by randomly removing majority class samples [6]. SMOTE-NC is an oversampling technique that uses Knearest neighbor characteristics in explanatory variables to produce synthetic data in the minority class [8].…”

Section: M-2024-495mentioning

confidence: 99%

“…Imbalanced data is data that has an unbalanced distribution of response variable classes, the number of one class is less or more than the number of other data classes [5]. Imbalanced class that is not resolved can affect the performance of the model used [6]. The data balancing methods used in this research are Random Oversampling, Random Undersampling, and Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTE-NC).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Performance of Random Oversampling, Random Undersampling, and SMOTE-NC Methods in Handling Imbalanced Class in Classification Models

Ratnasari

2024

int.jour.sci.res.mana

View full text Add to dashboard Cite

One common challenge in classification modeling is the existence of imbalanced classes within the data. If the analysis continues with imbalanced classes, it is probable that the result will demonstrate inadequate performance when forecasting new data. Various approaches exist to rectify this class imbalance issue, such as random oversampling, random undersampling, and the Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTE-NC). Each of these methods encompasses distinct techniques aimed at achieving balanced class distribution within the dataset. Comparison of classification performance on imbalanced classes handled by these three methods has never been carried out in previous research. Therefore, this study undertakes an evaluation of classification models (specifically Gradient Boosting, Random Forest, and Extremely Randomized Trees) in the context of imbalanced class data. The results of this research show that the random undersampling method used to balance the class distribution has the best performance on two classification models (Random Forest and Gradient Boosted Tree).

show abstract

Section: M-2024-495mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Performance of Random Oversampling, Random Undersampling, and SMOTE-NC Methods in Handling Imbalanced Class in Classification Models

Ratnasari

2024

int.jour.sci.res.mana

View full text Add to dashboard Cite

show abstract

“…DT is a tree-based classifier in which the attribute that produces the highest information gain, or minimum Gini index at each tree level is selected to partition the data into increasingly Fire classification with machine learning homogeneous subgroups whilst RF is a type of ensemble algorithm that constructs a specified number of trees, with each tree constructed by sampling a subset from the training set at random and with replacement (Chaipanha and Kaewwichian, 2022;Shah et al, 2020). Conversely, MLP is a fully-connected neural network with an input layer, hidden layer(s) and an output layer.…”

Section: Modellingmentioning

confidence: 99%

Fatal structure fire classification from building fire data using machine learning

Balakrishnan,

Mohammed Hashim,

Lee

et al. 2023

IJICC

View full text Add to dashboard Cite

PurposeThis study aims to develop a machine learning model to detect structure fire fatalities using a dataset comprising 11,341 cases from 2011 to 2019.Design/methodology/approachExploratory data analysis (EDA) was conducted prior to modelling, in which ten machine learning models were experimented with.FindingsThe main fatal structure fire risk factors were fires originating from bedrooms, living areas and the cooking/dining areas. The highest fatality rate (20.69%) was reported for fires ignited due to bedding (23.43%), despite a low fire incident rate (3.50%). Using 21 structure fire features, Random Forest (RF) yielded the best detection performance with 86% accuracy, followed by Decision Tree (DT) with bagging (accuracy = 84.7%).Research limitations/practical implicationsLimitations of the study are pertaining to data quality and grouping of categories in the data pre-processing stage, which could affect the performance of the models.Originality/valueThe study is the first of its kind to manipulate risk factors to detect fatal structure classification, particularly focussing on structure fire fatalities. Most of the previous studies examined the importance of fire risk factors and their relationship to the fire risk level.

show abstract

“…Chaipanha and Kaewwichian [47] To provide a way for balancing the data using over-and under-sampling strategies. kNN, NB, DTs No Manjushree, GH, Swamy and Giridharan [6] To apply ML models to forecast the household characteristics that influence car ownership.…”

Section: Study Study Aim Model(s) Used Hyperparameter Optimizationmentioning

confidence: 99%

Targeting Sustainable Transportation Development: The Support Vector Machine and the Bayesian Optimization Algorithm for Classifying Household Vehicle Ownership

Xu¹,

Aghaabbasi

Ali

et al. 2022

Sustainability

View full text Add to dashboard Cite

Predicting household vehicle ownership (HVO) is a crucial component of travel demand forecasting. Furthermore, reliable HVO prediction is critical for achieving sustainable transportation development objectives in an era of rapid urbanization. This research predicted the HVO using a support vector machine (SVM) model optimized using the Bayesian Optimization (BO) algorithm. BO is used to determine the optimal SVM parameter values. This hybrid model was applied to two datasets derived from the US National Household Travel Survey dataset. Thus, two optimized SVM models were developed, namely SVMBO#1 and SVMBO#2. Using the confusion matrix, accuracy, receiver operating characteristic (ROC), and area under the ROC, the outcomes of these two hybrid models were examined. Additionally, the results of hybrid SVM models were compared with those of other machine learning models. The results demonstrated that the BO algorithm enhanced the performance of the standard SVM model for predicting the HVO. The BO method determined the Gaussian kernel to be the optimal kernel function for both datasets. The performance of the SVM#1 model was improved by 4.27% and 5.16% for the training and testing phases, respectively. For SVM#2 model, the performance of this model was improved by 1.20% and 2.14% for the training and testing phases, respectively. Moreover, the BO method enhanced the AUC of the SVM models used to predict the HVO. The hybrid SVM models also outperformed other machine learning models developed in this study. The findings of this study showed that SVM models hybridized with the BO algorithm can effectively predict the HVO and can be employed in the process of travel demand forecasting.

show abstract

Smote vs. Random Undersampling for Imbalanced Data - Car Ownership Demand Model

Cited by 7 publications

References 28 publications

Performance of Random Oversampling, Random Undersampling, and SMOTE-NC Methods in Handling Imbalanced Class in Classification Models

Performance of Random Oversampling, Random Undersampling, and SMOTE-NC Methods in Handling Imbalanced Class in Classification Models

Fatal structure fire classification from building fire data using machine learning

Targeting Sustainable Transportation Development: The Support Vector Machine and the Bayesian Optimization Algorithm for Classifying Household Vehicle Ownership

Contact Info

Product

Resources

About