A Novel Ensemble Method for Imbalanced Data Learning: Bagging of Extrapolation-SMOTE SVM

Wang, Qi; Luo, Zhihao; Huang, Jincai; Feng, Yanghe; Liu, Zhong

doi:10.1155/2017/1827016

Cited by 117 publications

(50 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Low occurrence rates in relatively small datasets lead to large class-imbalances that are a significant challenge in medical machine learning. 22,23 To this end, we have trained several supervised machine learning classifiers to predict the probability of postoperative complications in a relatively small dataset (<15,000 patients) that can accurately learn complications with relatively low occurrence rates (<1%). We have rigorously developed and tested our models by employing the best practices in machine learning in this study by performing automated feature selection, L2 regularization, testing on blinded hold-out data sets, and comparing to a standard risk-scoring system to ensure a high standard that is necessary for implementation of machine learning in clinical settings.…”

Section: Discussionmentioning

confidence: 99%

Examining the Ability of Artificial Neural Networks Machine Learning Models to Accurately Predict Complications Following Posterior Lumbar Spine Fusion

Kim

Merrill

Arvind

et al. 2018

Spine

149

View full text Add to dashboard Cite

Study Design. A cross-sectional database study. Objective. The aim of this study was to train and validate machine learning models to identify risk factors for complications following posterior lumbar spine fusion. Summary of Background Data. Machine learning models such as artificial neural networks (ANNs) are valuable tools for analyzing and interpreting large and complex datasets. ANNs have yet to be used for risk factor analysis in orthopedic surgery. Methods. The American College of Surgeons National Surgical Quality Improvement Program (ACS-NSQIP) database was queried for patients who underwent posterior lumbar spine fusion. This query returned 22,629 patients, 70% of whom were used to train our models, and 30% were used to evaluate the models. The predictive variables used included sex, age, ethnicity, diabetes, smoking, steroid use, coagulopathy, functional status, American Society for Anesthesiology (ASA) class ≥3, body mass index (BMI), pulmonary comorbidities, and cardiac comorbidities. The models were used to predict cardiac complications, wound complications, venous thromboembolism (VTE), and mortality. Using ASA class as a benchmark for prediction, area under receiver operating curves (AUC) was used to determine the accuracy of our machine learning models. Results. On the basis of AUC values, ANN and LR both outperformed ASA class for predicting all four types of complications. ANN was the most accurate for predicting cardiac complications, and LR was most accurate for predicting wound complications, VTE, and mortality, though ANN and LR had comparable AUC values for predicting all types of complications. ANN had greater sensitivity than LR for detecting wound complications and mortality. Conclusion. Machine learning in the form of logistic regression and ANNs were more accurate than benchmark ASA scores for identifying risk factors of developing complications following posterior lumbar spine fusion, suggesting they are potentially great tools for risk factor analysis in spine surgery.

show abstract

Section: Discussionmentioning

confidence: 99%

Examining the Ability of Artificial Neural Networks Machine Learning Models to Accurately Predict Complications Following Posterior Lumbar Spine Fusion

Kim

Merrill

Arvind

et al. 2018

Spine

149

View full text Add to dashboard Cite

show abstract

“…A thing to take note when using supervised method for training is imbalanced data: The predictive models developed using conventional machine learning algorithms could be biased and inaccurate because the number of observations in one class of the dataset is significantly lower than the other. To handle imbalanced data, several methods can be used, including resampling, boosting, bagging [17][18][19][20].…”

Section: Supervised Modelmentioning

confidence: 99%

Designing a Streaming Algorithm for Outlier Detection in Data Mining—An Incrementa Approach

Shi

Santoro

2020

Sensors

View full text Add to dashboard Cite

To design an algorithm for detecting outliers over streaming data has become an important task in many common applications, arising in areas such as fraud detections, network analysis, environment monitoring and so forth. Due to the fact that real-time data may arrive in the form of streams rather than batches, properties such as concept drift, temporal context, transiency, and uncertainty need to be considered. In addition, data processing needs to be incremental with limited memory resource, and scalable. These facts create big challenges for existing outlier detection algorithms in terms of their accuracies when they are implemented in an incremental fashion, especially in the streaming environment. To address these problems, we first propose C_KDE_WR, which uses sliding window and kernel function to process the streaming data online, and reports its results demonstrating high throughput on handling real-time streaming data, implemented in a CUDA framework on Graphics Processing Unit (GPU). We also present another algorithm, C_LOF, based on a very popular and effective outlier detection algorithm called Local Outlier Factor (LOF) which unfortunately works only on batched data. Using a novel incremental approach that compensates the drawback of high complexity in LOF, we show how to implement it in a streaming context and to obtain results in a timely manner. Like C_KDE_WR, C_LOF also employs sliding-window and statistical-summary to help making decision based on the data in the current window. It also addresses all those challenges of streaming data as addressed in C_KDE_WR. In addition, we report the comparative evaluation on the accuracy of C_KDE_WR with the state-of-the-art SOD_GPU using Precision, Recall and F-score metrics. Furthermore, a t-test is also performed to demonstrate the significance of the improvement. We further report the testing results of C_LOF on different parameter settings and drew ROC and PR curve with their area under the curve (AUC) and Average Precision (AP) values calculated respectively. Experimental results show that C_LOF can overcome the masquerading problem, which often exists in outlier detection on streaming data. We provide complexity analysis and report experiment results on the accuracy of both C_KDE_WR and C_LOF algorithms in order to evaluate their effectiveness as well as their efficiencies.

show abstract

“…ADASYN [He, Bai, Garcia et al (2008)] is an important improvement of SMOTE, which generates the synthetic examples by the proportion of the majority ratio. SVM-SMOTE [Nguyen, Cooper and Kamei (2011);Wang, Luo, Huang et al (2017)] generates artificial support vectors by SMOTE and gets good experimental results. Although these algorithms have different generating tricks, the core generating method is still the selected line segment way.…”

Section: Related Workmentioning

confidence: 99%

Using Imbalanced Triangle Synthetic Data for Machine Learning Anomaly Detection

Luo¹,

Wang²,

Cai³

et al. 2019

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

The extreme imbalanced data problem is the core issue in anomaly detection. The amount of abnormal data is so small that we cannot get adequate information to analyze it. The mainstream methods focus on taking fully advantages of the normal data, of which the discrimination method is that the data not belonging to normal data distribution is the anomaly. From the view of data science, we concentrate on the abnormal data and generate artificial abnormal samples by machine learning method. In this kind of technologies, Synthetic Minority Over-sampling Technique and its improved algorithms are representative milestones, which generate synthetic examples randomly in selected line segments. In our work, we break the limitation of line segment and propose an Imbalanced Triangle Synthetic Data method. In theory, our method covers a wider range. In experiment with real world data, our method performs better than the SMOTE and its meliorations.

show abstract

A Novel Ensemble Method for Imbalanced Data Learning: Bagging of Extrapolation-SMOTE SVM

Cited by 117 publications

References 35 publications

Examining the Ability of Artificial Neural Networks Machine Learning Models to Accurately Predict Complications Following Posterior Lumbar Spine Fusion

Examining the Ability of Artificial Neural Networks Machine Learning Models to Accurately Predict Complications Following Posterior Lumbar Spine Fusion

Designing a Streaming Algorithm for Outlier Detection in Data Mining—An Incrementa Approach

Using Imbalanced Triangle Synthetic Data for Machine Learning Anomaly Detection

Contact Info

Product

Resources

About