Motivation:Many studies have shown that RNA secondary structure plays a vital role in fundamental cellular processes, such as protein synthesis, mRNA processing, mRNA assembly, ribosome function and eukaryotic spliceosomes. Identification of RNA secondary structure is a key step to understand the common mechanisms underlying the translation process. Recently, a few experimental methods were developed to measure genomewide RNA secondary structure profile through high-throughput sequencing techniques, and have been successfully applied to genomes including yeast and human. However, these high-throughput methods usually have low precision and are hard to cover all nucleotides on the RNA due to limited sequencing coverage.
Results:In this study, we developed a new method for the prediction of genome-wide RNA secondary structure profile (TH-GRASP) from RNA sequence based on eXtreme Gradient Boosting (XGBoost). The method achieves an prediction with areas under the receiver operating characteristic curve (AUC) values greater than 0.9 on three different datasets, and AUC of 0.892 by an independent test on the recently released Zika virus RNA dataset. These AUCs represent a consistent increase of >6% than the recently developed method CROSS trained by a shallow neural network. A further analysis on the 1000-Genome Project data showed that our predicted unpaired probability at mutations sites are highly correlated with the minor allele frequencies (MAF) of synonymous, nonsynonymous mutations, and mutations in 3' and 5'UTR with Pearson Correlation Coefficients all above 0.8. These PCCs are consistently higher than those generated by RNAplfold method. Moreover, an investigation over all human mRNA indicated a periodic distribution of the predicted unpaired probability on codons, and a decrease of paired probability in the boundary with 5' and 3' untranslated regions. These results highlighted TH-GRASP is effective to remove experimental noises and to have ability to make predictions on nucleotides with low or no coverage by fitting high-throughput genomic data for RNA secondary structure profiles, and also suggested that building model on high throughput experimental data might be a future direction to substitute analytical methods.