A New Oversampling Method Based on the Classification Contribution Degree

Jiang, Zhenhao; Pan, Tingting; Zhang, Chao; Yang, Jie

doi:10.3390/sym13020194

Cited by 62 publications

(26 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The oversampling method was applied to molecular description data by Chang et al, and reported that it could be used to reduce the overfitting problem [ 59 ]. However, oversampling has some disadvantages, such as sample overlapping, noise interference, and blindness of neighbor selection [ 60 ]. The main disadvantage of oversampling is that by making copies from existing data, overfitting is likely; in contrast, the main disadvantage of undersampling is the discarding of potentially useful data [ 61 ].…”

Section: Discussionmentioning

confidence: 99%

Machine Learning in Prediction of Bladder Cancer on Clinical Laboratory Data

et al. 2022

View full text Add to dashboard Cite

Bladder cancer has been increasing globally. Urinary cytology is considered a major screening method for bladder cancer, but it has poor sensitivity. This study aimed to utilize clinical laboratory data and machine learning methods to build predictive models of bladder cancer. A total of 1336 patients with cystitis, bladder cancer, kidney cancer, uterus cancer, and prostate cancer were enrolled in this study. Two-step feature selection combined with WEKA and forward selection was performed. Furthermore, five machine learning models, including decision tree, random forest, support vector machine, extreme gradient boosting (XGBoost), and light gradient boosting machine (GBM) were applied. Features, including calcium, alkaline phosphatase (ALP), albumin, urine ketone, urine occult blood, creatinine, alanine aminotransferase (ALT), and diabetes were selected. The lightGBM model obtained an accuracy of 84.8% to 86.9%, a sensitivity 84% to 87.8%, a specificity of 82.9% to 86.7%, and an area under the curve (AUC) of 0.88 to 0.92 in discriminating bladder cancer from cystitis and other cancers. Our study provides a demonstration of utilizing clinical laboratory data to predict bladder cancer.

show abstract

Section: Discussionmentioning

confidence: 99%

Machine Learning in Prediction of Bladder Cancer on Clinical Laboratory Data

et al. 2022

View full text Add to dashboard Cite

show abstract

“…Other ways of oversampling include, but are not limited to, the work of [91,92,93,94,78,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119] The validation process is what all oversampling methods have in common, which is basically the evaluation of the classifier's performance employed to classify the oversampled datasets using one or more accuracy measures such as Accuracy, Precision, Recall, F-measure, G-mean, Specificity, Kappa, Matthews correlation coefficient (MCC), Area under the ROC Curve (AUC), True positive rate, False negative (FN), False positive (FP), True positive (TP), True negative (TN), and ROC curve. Table 1 lists 72 oversampling methods, including their known names, references, the number of datasets utilized, the number of classes in these datasets, the classifiers employed, and the performance metrics used to validate the classification results after oversampling.…”

Section: Literature Review Of Oversampling Methodsmentioning

confidence: 99%

Stop Oversampling for Class Imbalance Learning: A Critical Review

Hassanat

Tarawneh

Altarawneh

2022

Preprint

View full text Add to dashboard Cite

For the last two decades, oversampling has been employed to overcome the challenge of learning from imbalanced datasets. Many approaches to solving this challenge have been offered in the literature. Oversampling, on the other hand, is a concern. That is, models trained on fictitious data may fail spectacularly when put to real-world problems. The fundamental difficulty with oversampling approaches is that, given a real-life population, the synthesized samples may not truly belong to the minority class. As a result, training a classifier on these samples while pretending they represent minority may result in incorrect predictions when the model is used in the real world. We analyzed a large number of oversampling methods in this paper and devised a new oversampling evaluation system based on hiding a number of majority examples and comparing them to those generated by the oversampling process. Based on our evaluation system, we ranked all these methods based on their incorrectly generated examples for comparison. Our experiments using more than 70 oversampling methods and three imbalanced real-world datasets reveal that all oversampling methods studied generate minority samples that are most likely to be majority. Given data and methods in hand, we argue that oversampling in its current forms and methodologies is unreliable for learning from class imbalanced data and should be avoided in real-world applications.

show abstract

“…For this reason, the existing data set was enriched with more records using the standard method of SMOTE [23]. SMOTE is a popular method of machine learning used for oversampling [29] in which the minority class in a data set is generated by a synthetic example in the feature area based on the selected k-nearest neighbor (k-NN) from the minority class [21]. This practice has been adopted in several biomedical studies [4,[30][31][32][33][34][35][36].…”

Section: Data Enrichmentmentioning

confidence: 99%

Reliable Prediction Models Based on Enriched Data for Identifying the Mode of Childbirth by Using Machine Learning Methods: Development Study

Ullah¹,

Saleem²,

Jamjoom³

et al. 2021

J Med Internet Res

View full text Add to dashboard Cite

Background The use of artificial intelligence has revolutionized every area of life such as business and trade, social and electronic media, education and learning, manufacturing industries, medicine and sciences, and every other sector. The new reforms and advanced technologies of artificial intelligence have enabled data analysts to transmute raw data generated by these sectors into meaningful insights for an effective decision-making process. Health care is one of the integral sectors where a large amount of data is generated daily, and making effective decisions based on these data is therefore a challenge. In this study, cases related to childbirth either by the traditional method of vaginal delivery or cesarean delivery were investigated. Cesarean delivery is performed to save both the mother and the fetus when complications related to vaginal birth arise. Objective The aim of this study was to develop reliable prediction models for a maternity care decision support system to predict the mode of delivery before childbirth. Methods This study was conducted in 2 parts for identifying the mode of childbirth: first, the existing data set was enriched and second, previous medical records about the mode of delivery were investigated using machine learning algorithms and by extracting meaningful insights from unseen cases. Several prediction models were trained to achieve this objective, such as decision tree, random forest, AdaBoostM1, bagging, and k-nearest neighbor, based on original and enriched data sets. Results The prediction models based on enriched data performed well in terms of accuracy, sensitivity, specificity, F-measure, and receiver operating characteristic curves in the outcomes. Specifically, the accuracy of k-nearest neighbor was 84.38%, that of bagging was 83.75%, that of random forest was 83.13%, that of decision tree was 81.25%, and that of AdaBoostM1 was 80.63%. Enrichment of the data set had a good impact on improving the accuracy of the prediction process, which supports maternity care practitioners in making decisions in critical cases. Conclusions Our study shows that enriching the data set improves the accuracy of the prediction process, thereby supporting maternity care practitioners in making informed decisions in critical cases. The enriched data set used in this study yields good results, but this data set can become even better if the records are increased with real clinical data.

show abstract

A New Oversampling Method Based on the Classification Contribution Degree

Cited by 62 publications

References 35 publications

Machine Learning in Prediction of Bladder Cancer on Clinical Laboratory Data

Machine Learning in Prediction of Bladder Cancer on Clinical Laboratory Data

Stop Oversampling for Class Imbalance Learning: A Critical Review

Reliable Prediction Models Based on Enriched Data for Identifying the Mode of Childbirth by Using Machine Learning Methods: Development Study

Contact Info

Product

Resources

About