An ensemble classifier approach for microRNA precursor (pre-miRNA) classification was
proposed based upon combining a set of heterogeneous algorithms including support vector
machine (SVM), k-nearest neighbors (kNN) and random forest (RF), then aggregating their
prediction through a voting system. Additionally, the proposed algorithm, the
classification performance was also improved using discriminative features,
self-containment and its derivatives, which have shown unique structural robustness
characteristics of pre-miRNAs. These are applicable across different species. By applying
preprocessing methods—both a correlation-based feature selection (CFS) with genetic
algorithm (GA) search method and a modified-Synthetic Minority Oversampling Technique
(SMOTE) bagging rebalancing method—improvement in the performance of this ensemble
was observed. The overall prediction accuracies obtained via 10 runs of 5-fold cross
validation (CV) was 96.54%, with sensitivity of 94.8% and specificity of
98.3%—this is better in trade-off sensitivity and specificity values than
those of other state-of-the-art methods. The ensemble model was applied to animal, plant
and virus pre-miRNA and achieved high accuracy, >93%. Exploiting the
discriminative set of selected features also suggests that pre-miRNAs possess high
intrinsic structural robustness as compared with other stem loops. Our heterogeneous
ensemble method gave a relatively more reliable prediction than those using single
classifiers. Our program is available at http://ncrna-pred.com/premiRNA.html.
To identify non-coding RNA (ncRNA) signals within genomic regions, a classification tool was developed based on a hybrid random forest (RF) with a logistic regression model to efficiently discriminate short ncRNA sequences as well as long complex ncRNA sequences. This RF-based classifier was trained on a well-balanced dataset with a discriminative set of features and achieved an accuracy, sensitivity and specificity of 92.11%, 90.7% and 93.5%, respectively. The selected feature set includes a new proposed feature, SCORE. This feature is generated based on a logistic regression function that combines five significant features—structure, sequence, modularity, structural robustness and coding potential—to enable improved characterization of long ncRNA (lncRNA) elements. The use of SCORE improved the performance of the RF-based classifier in the identification of Rfam lncRNA families. A genome-wide ncRNA classification framework was applied to a wide variety of organisms, with an emphasis on those of economic, social, public health, environmental and agricultural significance, such as various bacteria genomes, the Arthrospira (Spirulina) genome, and rice and human genomic regions. Our framework was able to identify known ncRNAs with sensitivities of greater than 90% and 77.7% for prokaryotic and eukaryotic sequences, respectively. Our classifier is available at http://ncrna-pred.com/HLRF.htm.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.