In today's world, due to the advancement of technology, predicting the students' performance is among the most beneficial and essential research topics. Data Mining is extremely helpful in the field of education, especially for analyzing students' performance. It is a fact that predicting the students' performance has become a severe challenge because of the imbalanced datasets in this field, and there is not any comparison among different resampling methods. This paper attempts to compare various resampling techniques such as Borderline SMOTE, Random Over Sampler, SMOTE, SMOTE-ENN, SVM-SMOTE, and SMOTE-Tomek to handle the imbalanced data problem while predicting students' performance using two different datasets. Moreover, the difference between multiclass and binary classification, and structures of the features are examined. To be able to check the performance of the resampling methods better in solving the imbalanced problem, this paper uses various machine learning classifiers including Random Forest, K-Nearest-Neighbor, Artificial Neural Network, XG-boost, Support Vector Machine (Radial Basis Function), Decision Tree, Logistic Regression, and Naïve Bayes. Furthermore, the Random hold-out and Shuffle 5-fold cross-validation methods are used as model validation techniques. The achieved results using different evaluation metrics indicate that fewer numbers of classes and nominal features will lead models to better performance. Also, classifiers do not perform well with imbalanced data, so solving this problem is necessary. The performance of classifiers is improved using balanced datasets. Additionally, the results of the Friedman test, which is a statistical significance test, confirm that the SVM-SMOTE is more efficient than the other resampling methods. Moreover, The Random Forest classifier has achieved the best result among all other models while using SVM-SMOTE as a resampling method. INDEX TERMS Classification, data mining, educational data mining, imbalanced data problem, machine learning, resampling methods, statistical analysis.
Due to the increasing technological advances in all fields, a considerable amount of data has been collected to be processed for different purposes. Data mining is the process of determining and analyzing hidden information from different perspectives to obtain useful knowledge. Data mining can have many various applications, one of them is in medical diagnosis. Today, many diseases are regarded as dangerous and deadly. Heart disease, breast cancer, and diabetes are among the most dangerous ones. This paper investigates 168 articles associated with the implementation of data mining for diagnosing such diseases. The study concentrates on 85 selected papers which have received more attention between 1997 and 2018. All algorithms, data mining models, and evaluation methods are thoroughly reviewed with special consideration. The study attempts to determine the most efficient data mining methods used for medical diagnosing purposes. Also, one of the other significant results of this study is the detection of research gaps in the application of data mining in health care.
Due to the development of biomedical equipment and healthcare level, especially in the Intensive Care Unit (ICU), a considerable amount of data has been collected for analysis. Mortality prediction in the ICUs is considered as one of the most important topics in the healthcare data analysis section. A precise prediction of the mortality risk for patients in ICU could provide us with valuable information about patients' lives and reduce costs at the earliest possible stage. This paper aims to introduce a new hybrid predictive model using the Genetic Algorithm as a feature selection method and a new ensemble classifier based on the combination of Stacking and Boosting ensemble methods to create an early mortality prediction model on a highly imbalanced dataset. The SVM-SMOTE method is used to solve the imbalanced data problem. This paper compares the new model with various machine learning models to validate the efficiency of the introduced model. The achieved results using the shuffle 5-fold cross-validation and random hold-out methods indicate that the new hybrid model has the best performance among other classifiers. Additionally, the Friedman test is applied as a statistical significance test to examine the differences between classifiers. The results of the statistical analysis prove that the proposed model is more effective than other classifiers. Furthermore, the proposed model is compared to APACHE and SAPS scoring systems and is benchmarked against state-of-the-art predictive models applied to the MIMIC dataset for experimental validation and achieved promising results as it outperformed the state-of-the-art models.
Purpose The purpose of this paper is to estimate energy efficiency of 132 countries from 2007 to 2014 according to their performance, categorizing the nations into similar groups. Design/methodology/approach Data envelopment analysis model based on Goal Programming and then K-Means clustering algorithm are used to determine the efficiency and clustering the nations based on their efficiency performances. Findings The results of the study reveal that developing low-income countries could lead to high energy-efficiency scores, and countries with different development and income levels can become efficient in the field of energy consumption. Following the nations during a seven-year period also indicates that the changes in energy-related indicators such as renewable energy consumption and energy productivity are the main drivers to move a country between clusters. Originality/value The present study aimed to investigate whether similar nations with similar energy efficiency level in a cluster are similar in their development and income level, and changing the energy consumption pattern during the seven-year period could move the countries from a cluster to another one.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.