Many researchers conduct research using the classification method, to find out the best method for predicting the class of an observation. Some of these studies explain that random forest is the best method. However, the classification of data containing outliers and unbalanced data is a complicated problem. Many researchers are also conducting research to deal with these problems. In this study, we propose a winsorizing to deal with outliers by replacing the outlier values with the upper and lower limit values obtained from the interquartile range method and random oversampling to balance the data. It is also known that cases of the Human Development Index (HDI) in regencies/cities in eastern Indonesia vary widely, so cases of HDI in these areas can be used as case studies of data containing outliers and unbalanced data. The purpose of this study was to compare the performance of the random forest before and after the data were applied to the winsorizing and random oversampling to predict HDI in districts/cities in eastern Indonesia. Classification method random forest after handling data containing outliers and unbalanced data has better performance in terms of accuracy and kappa values, which are 96.43% and 93.41%, respectively. The variables of expenditure per capita and the mean years of schooling are the most important.
In unit-level small area estimation (SAE), the commonly used nested error regression (NER) model assumes normality which is not always the case. To handle non-normal data, researchers in statistics have developed a novel approach using exchangeable and extendible copula called the multivariate exchangeable copula (MEC) model. This study compares the performance of parametric MEC and NER models in estimating the sub-district average of per capita expenditure (PCE) in Pidie Regency, Aceh Province. This study presents PCE, which has a skewed distribution of the three-parameter skew-normal. The parametric MEC model uses a Gaussian copula from the Elliptical family and an empirical best unbiased prediction (EBUP) estimator. Meanwhile, the NER model uses an empirical best linear unbiased prediction (EBLUP) estimator. The results reveal that at a 95% confidence level, the parametric MEC model outperforms the NER model with a smaller root of mean squared error (RMSE) and provides a more precise estimate of the sub-district average of PCE. This study highlights the importance of considering the parametric MEC model as an alternative method for skewed data in unit-level SAE. The results of this study have the potential to support the achievement of Goal 1 (to end poverty) and Goal 10 (to reduce inequality) of the sustainable development goals (SDGs) by providing average PCE estimates at the sub-district level.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.