CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests

Ma, Li; Fan, Shuxiang

doi:10.1186/s12859-017-1578-z

Cited by 184 publications

(77 citation statements)

References 63 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The recently proposed new re-sample methods [44] can be considered in future work. Furthermore, the feature extraction functions and the pre-trained classifiers of this method can be easily embedded into the LC-MS based quantitative proteomics analysis pipeline.…”

Section: Resultsmentioning

confidence: 99%

Quality control of imbalanced mass spectra from isotopic labeling experiments

Chen

Gan

2019

BMC Bioinformatics

View full text Add to dashboard Cite

Background: Mass spectra are usually acquired from the Liquid Chromatography-Mass Spectrometry (LC-MS) analysis for isotope labeled proteomics experiments. In such experiments, the mass profiles of labeled (heavy) and unlabeled (light) peptide pairs are represented by isotope clusters (2D or 3D) that provide valuable information about the studied biological samples in different conditions. The core task of quality control in quantitative LC-MS experiment is to filter out low-quality peptides with questionable profiles. The commonly used methods for this problem are the classification approaches. However, the data imbalance problems in previous control methods are often ignored or mishandled. In this study, we introduced a quality control framework based on the extreme gradient boosting machine (XGBoost), and carefully addressed the imbalanced data problem in this framework. Results: In the XGBoost based framework, we suggest the application of the Synthetic minority over-sampling technique (SMOTE) to re-balance data and use the balanced data to train the boosted trees as the classifier. Then the classifier is applied to other data for the peptide quality assessment. Experimental results show that our proposed framework increases the reliability of peptide heavy-light ratio estimation significantly. Conclusions: Our results indicate that this framework is a powerful method for the peptide quality assessment. For the feature extraction part, the extracted ion chromatogram (XIC) based features contribute to the peptide quality assessment. To solve the imbalanced data problem, SMOTE brings a much better classification performance. Finally, the XGBoost is capable for the peptide quality control. Overall, our proposed framework provides reliable results for the further proteomics studies.

show abstract

Section: Resultsmentioning

confidence: 99%

Quality control of imbalanced mass spectra from isotopic labeling experiments

Chen

Gan

2019

BMC Bioinformatics

View full text Add to dashboard Cite

show abstract

“…Moreover, the algorithm often fails due to storage and computation defects [49]. Finally, in the RF model, each tree randomly selects some samples and some features to avoid overfitting; consequently, the model features a good anti-noise ability and stable performance [50,51]. Furthermore, the RF model can handle very high-dimensional data and omit the work associated with feature selection [52].…”

Section: Discussionmentioning

confidence: 99%

MODIS Fractional Snow Cover Mapping Using Machine Learning Technology in a Mountainous Area

Liu

Huang

et al. 2020

Remote Sensing

View full text Add to dashboard Cite

To improve the poor accuracy of the MODIS (Moderate Resolution Imaging Spectroradiometer) daily fractional snow cover product over the complex terrain of the Tibetan Plateau (RMSE = 0.30), unmanned aerial vehicle and machine learning technologies are employed to map the fractional snow cover based on MODIS over this terrain. Three machine learning models, including random forest, support vector machine, and back-propagation artificial neural network models, are trained and compared in this study. The results indicate that compared with the MODIS daily fractional snow cover product, the introduction of a highly accurate snow map acquired by unmanned aerial vehicles as a reference into machine learning models can significantly improve the MODIS fractional snow cover mapping accuracy. The random forest model shows the best accuracy among the three machine learning models, with an RMSE (root-mean-square error) of 0.23, especially over forestland and shrubland, with RMSEs of 0.13 and 0.18, respectively. Although the accuracy of the support vector machine and back-propagation artificial neural network models are worse over forestland and shrubland, their average errors are still better than that of MOD10A1. Different fractional snow cover gradients also affect the accuracy of the machine learning algorithms. Nevertheless, the random forest model remains stable in different fractional snow cover gradients and is, therefore, the best machine learning algorithm for MODIS fractional snow cover mapping in Tibetan Plateau areas with complex terrain and severely fragmented snow cover.

show abstract

“…The most well-known oversampling method is synthetic minority oversampling technique (SMOTE) proposed by Chawla et al [15]. The main idea of SMOTE [15][16][17][18][19] is to identify k minority class neighbors close to each minority class sample, then randomly select a point between the sample and its neighbors as the synthetic sample. But SMOTE produces new samples with certain blindness and may make class overlapping more serious.…”

Section: Related Workmentioning

confidence: 99%

A Novel Ensemble Framework Based on K-Means and Resampling for Imbalanced Data

et al. 2020

View full text Add to dashboard Cite

Imbalanced classification is one of the most important problems of machine learning and data mining, existing in many real datasets. In the past, many basic classifiers such as SVM, KNN, and so on have been used for imbalanced datasets in which the number of one sample is larger than that of another, but the classification effect is not ideal. Some data preprocessing methods have been proposed to reduce the imbalance ratio of data sets and combine with the basic classifiers to get better performance. In order to improve the whole classification accuracy, we propose a novel classifier ensemble framework based on K-means and resampling technique (EKR). First, we divide the data samples in the majority class into several sub-clusters using K-means, k-value is determined by Average Silhouette Coefficient, and then adjust the number of data samples of each sub-cluster to be the same as that of the minority classes through resampling technology, after that each adjusted sub-cluster and the minority class are combined into several balanced subsets, the base classifier is trained on each balanced subset separately, and finally integrated into a strong ensemble classifier. In this paper, the extensive experimental results on 16 imbalanced datasets demonstrate the effectiveness and feasibility of the proposed algorithm in terms of multiple evaluation criteria, and EKR can achieve better performance when compared with several classical imbalanced classification algorithms using different data preprocessing methods.

show abstract

CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests

Cited by 184 publications

References 63 publications

Quality control of imbalanced mass spectra from isotopic labeling experiments

Quality control of imbalanced mass spectra from isotopic labeling experiments

MODIS Fractional Snow Cover Mapping Using Machine Learning Technology in a Mountainous Area

A Novel Ensemble Framework Based on K-Means and Resampling for Imbalanced Data

Contact Info

Product

Resources

About