Ensemble Learning with Active Example Selection for Imbalanced Biomedical Data Classification

Oh, Sangyoon; Lee, Min Su; Zhang, Byoung-Tak

doi:10.1109/tcbb.2010.96

Cited by 76 publications

(15 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In SVIS, the authors decreased the size of the classifier committee. Ensemble classifiers were utilized in other algorithms to tackle real-world problems, e.g., selecting refined training sets from biomedical data (Oh et al 2011). …”

Section: Neighborhood Analysis Methodsmentioning

confidence: 99%

“…Therefore, applying an appropriate approach for selecting desired training sets is inevitable. Oh et al (2011) investigated their SVM training set selection using such imbalanced sets for various diseases (leukemia, diabetes, Parkinson's disease, hepatitis, breast cancer and cardiac diseases). These datasets included up to 800 vectors (Diabetes dataset), and the number of features was up to almost 7200 in the Leukemia dataset.…”

Section: Datasets and Practical Applicationsmentioning

confidence: 99%

See 1 more Smart Citation

Selecting training sets for support vector machines: a review

2018

View full text Add to dashboard Cite

Support vector machines (SVMs) are a supervised classifier successfully applied in a plethora of real-life applications. However, they suffer from the important shortcomings of their high time and memory training complexities, which depend on the training set size. This issue is especially challenging nowadays, since the amount of data generated every second becomes tremendously large in many domains. This review provides an extensive survey on existing methods for selecting SVM training data from large datasets. We divide the state-of-the-art techniques into several categories. They help understand the underlying ideas behind these algorithms, which may be useful in designing new methods to deal with this important problem. The review is complemented with the discussion on the future research pathways which can make SVMs easier to exploit in practice.

show abstract

Section: Neighborhood Analysis Methodsmentioning

confidence: 99%

Section: Datasets and Practical Applicationsmentioning

confidence: 99%

Selecting training sets for support vector machines: a review

2018

View full text Add to dashboard Cite

show abstract

“…The reason we choose the ensemble learning method is because it is believed to perform well for imbalanced data [29, 30, 32]. We employ an ensemble of 1000 deep trees that have minimal leaf size of 5 with a learning rate 0.1 in RUBoost learning to attain a high ensemble accuracy.…”

Section: Resultsmentioning

confidence: 99%

Diagnostic biases in translational bioinformatics

Han

2015

BMC Med Genomics

View full text Add to dashboard Cite

BackgroundWith the surge of translational medicine and computational omics research, complex disease diagnosis is more and more relying on massive omics data-driven molecular signature detection. However, how to detect and prevent possible diagnostic biases in translational bioinformatics remains an unsolved problem despite its importance in the coming era of personalized medicine.MethodsIn this study, we comprehensively investigate the diagnostic bias problem by analyzing benchmark gene array, protein array, RNA-Seq and miRNA-Seq data under the framework of support vector machines for different model selection methods. We further categorize the diagnostic biases into different types by conducting rigorous kernel matrix analysis and provide effective machine learning methods to conquer the diagnostic biases.ResultsIn this study, we comprehensively investigate the diagnostic bias problem by analyzing benchmark gene array, protein array, RNA-Seq and miRNA-Seq data under the framework of support vector machines. We have found that the diagnostic biases happen for data with different distributions and SVM with different kernels. Moreover, we identify total three types of diagnostic biases: overfitting bias, label skewness bias, and underfitting bias in SVM diagnostics, and present corresponding reasons through rigorous analysis. Compared with the overfitting and underfitting biases, the label skewness bias is more challenging to detect and conquer because it can be easily confused as a normal diagnostic case from its deceptive accuracy. To tackle this problem, we propose a derivative component analysis based support vector machines to conquer the label skewness bias by achieving the rivaling clinical diagnostic results.ConclusionsOur studies demonstrate that the diagnostic biases are mainly caused by the three major factors, i.e. kernel selection, signal amplification mechanism in high-throughput profiling, and training data label distribution. Moreover, the proposed DCA-SVM diagnosis provides a generic solution for the label skewness bias overcome due to the powerful feature extraction capability from derivative component analysis. Our work identifies and solves an important but less addressed problem in translational research. It also has a positive impact on machine learning for adding new results to kernel-based learning for omics data.

show abstract

“…The popular method to solve imbalanced data problem is random re-sampling technique which balances the number of training examples among classes [6]. Common random resampling techniques include the random over sampling (ROS) and the random under sampling (RUS).…”

Section: Related Workmentioning

confidence: 99%

“…In many cases, the user is more interested in minority class. Thus, addressing and solving imbalanced data problem is very critical for improving classification performance [6] Random forest [7] is an ensemble classifier that consists of many decision trees and outputs the class that is the majority of the classes of all the individual trees. The method combines bootstrap and the node randomly split technical to train multiple trees, and the classification result is decided by majority voting.…”

Section: Introductionmentioning

confidence: 99%

An Improved Random Forest Algorithm for Class-Imbalanced Data Classification and its Application in PAD Risk Factors Analysis

Yao¹,

Yang²,

Zhan³

2013

TOEEJ

View full text Add to dashboard Cite

The classification problem is one of the important research subjects in the field of machine learning. However, most machine learning algorithms train a classifier based on the assumption that the number of training examples of classes is almost equal. When a classifier was trained on imbalanced data, the performance of the classifier declined clearly. For resolving the class-imbalanced problem, an improved random forest algorithm was proposed based on sampling with replacement. We extracted multiple example subsets randomly with replacement from majority class, and the example number of extracted example subsets is as the same with minority class example dataset. Then, multiple new training datasets were constructed by combining the each exacted majority example subset and minority class dataset respectively, and multiple random forest classifiers were training on these training dataset. For a prediction example, the class was determined by majority voting of multiple random forest classifiers. The experimental results on five groups UCI datasets and a real clinical dataset show that the proposed method could deal with the class-imbalanced data problem and the improved random forest algorithm outperformed original random forest and other methods in literatures.

show abstract

Ensemble Learning with Active Example Selection for Imbalanced Biomedical Data Classification

Cited by 76 publications

References 20 publications

Selecting training sets for support vector machines: a review

Selecting training sets for support vector machines: a review

Diagnostic biases in translational bioinformatics

An Improved Random Forest Algorithm for Class-Imbalanced Data Classification and its Application in PAD Risk Factors Analysis

Contact Info

Product

Resources

About