Genome-scale characterization of RNA tertiary structures and their functional impact by RNA solvent accessibility prediction

Yang, Yuedong; Li, Xiaomei; Zhao, Huiying; Zhan, Jian; Wang, Jihua; Zhou, Yaoqi

doi:10.1261/rna.057364.116

Cited by 31 publications

(49 citation statements)

References 57 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is probably because synonymous mutations that don't change expressed proteins affect biological functions mainly through the change of RNA secondary structure. The predictions by RNAplfold achieved a PCC of 0.749 that's greater than the PCC of 0.473 with the predicted ASA(accessible surface area) from the RNAsnap-seq, consistent with the previous study (Yang, et al, 2017). This ranking order is consistent with all other four types of mutations, non-synonymous mutations, stop-gain mutations, and mutations occurring in the 3'UTR (untranslated region), and 5'UTR regions (Figure 3 and Figure S1-S3 in supplemental file).…”

Section: Relation Of Predicted Secondary Structure With the Maf Of Gesupporting

confidence: 90%

Accurate Prediction of Genome-wide RNA Secondary Structure Profile Based On Extreme Gradient Boosting

Rao

Zhao

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

Motivation:Many studies have shown that RNA secondary structure plays a vital role in fundamental cellular processes, such as protein synthesis, mRNA processing, mRNA assembly, ribosome function and eukaryotic spliceosomes. Identification of RNA secondary structure is a key step to understand the common mechanisms underlying the translation process. Recently, a few experimental methods were developed to measure genomewide RNA secondary structure profile through high-throughput sequencing techniques, and have been successfully applied to genomes including yeast and human. However, these high-throughput methods usually have low precision and are hard to cover all nucleotides on the RNA due to limited sequencing coverage. Results:In this study, we developed a new method for the prediction of genome-wide RNA secondary structure profile (TH-GRASP) from RNA sequence based on eXtreme Gradient Boosting (XGBoost). The method achieves an prediction with areas under the receiver operating characteristic curve (AUC) values greater than 0.9 on three different datasets, and AUC of 0.892 by an independent test on the recently released Zika virus RNA dataset. These AUCs represent a consistent increase of >6% than the recently developed method CROSS trained by a shallow neural network. A further analysis on the 1000-Genome Project data showed that our predicted unpaired probability at mutations sites are highly correlated with the minor allele frequencies (MAF) of synonymous, nonsynonymous mutations, and mutations in 3' and 5'UTR with Pearson Correlation Coefficients all above 0.8. These PCCs are consistently higher than those generated by RNAplfold method. Moreover, an investigation over all human mRNA indicated a periodic distribution of the predicted unpaired probability on codons, and a decrease of paired probability in the boundary with 5' and 3' untranslated regions. These results highlighted TH-GRASP is effective to remove experimental noises and to have ability to make predictions on nucleotides with low or no coverage by fitting high-throughput genomic data for RNA secondary structure profiles, and also suggested that building model on high throughput experimental data might be a future direction to substitute analytical methods.

show abstract

Section: Relation Of Predicted Secondary Structure With the Maf Of Gesupporting

confidence: 90%

Accurate Prediction of Genome-wide RNA Secondary Structure Profile Based On Extreme Gradient Boosting

Rao

Zhao

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…This is because sequence conservations in regions with different flexibility have different patterns. In a previous study, we have obtained evolution‐based sequence profiles by querying the RNA sequences against RNA sequence library using BLASTN with E ‐value < 0.001 and maximum of 50,000 homologous sequences . The j base probability ( j = A, T/U, G, C) in multiple aligned homologous sequences at a given position i , P i , j was calculated as P i , j = – log[( N i , j )/∑ j ( N i , j )], where N i , j is the number of observed base type j at position i .…”

Section: Methodsmentioning

confidence: 99%

“…s ( b i ) was set to 0.3 for the other base type b i and 9.0 for the query base type. The obtained sequence profiles were normalized to a range of (–1, 1) before used for training and test …”

Section: Methodsmentioning

confidence: 99%

B‐factor profile prediction for RNA flexibility using support vector machines

et al. 2017

Self Cite

View full text Add to dashboard Cite

Determining the flexibility of structured biomolecules is important for understanding their biological functions. One quantitative measurement of flexibility is the atomic Debye-Waller factor or temperature B-factor. Most existing studies are limited to temperature B-factors of proteins and their prediction. Only one method attempted to predict temperature B-factors of ribosomal RNA. Here, we developed and compared machine-learning techniques in prediction of temperature B-factors of RNAs. The best model based on Support Vector Machines yields Pearson's correction coefficient at 0.51 for fivefold cross validation and 0.50 for the independent test. Analysis of the performance indicates that the model has the best performance on rRNAs, tRNAs, and protein-bound RNAs, for long chains in particular. The server is available at http://sparks-lab.org/server/RNAflex. © 2017 Wiley Periodicals, Inc.

show abstract

“…Moreover, it is the lowest sequence identity cutoff allowed by the program CD-HIT [59]. This cutoff was also employed previously for establishing non-redundant RNA sequences [61,62] In addition to the HTlncRNA set as the negative set, we also included mRNAs from GENCODE V19 as the negative set. These mRNAs were randomly selected with <80% sequence similarity between each other and from selected HTlncRNAs and EVlncRNAs.…”

Section: Training and Test Datasets For Human Lncrnasmentioning

confidence: 99%

“…Predicted solvent accessible surface area (ASA) of RNA. RNA ASA values were predicted by RNAsnap [61].…”

Section: Features Based On Sequencesmentioning

confidence: 99%

Predicting functional long non-coding RNAs validated by low throughput experiments

Zhou

Yang

Zhan

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

High-throughput techniques have uncovered hundreds and thousands of long non-coding RNAs (lncRNAs). Among them, only a small fraction has experimentally validated functions (EVlncRNAs) by low-throughput methods.What fraction of lncRNAs from high-throughput experiments (HTlncRNAs) is truly functional is an active subject of debate. Here, we developed the first method to distinguish EVlncRNAs from HTlncRNAs and mRNAs by using Support Vector Machines and found that EVlncRNAs can be well separated from HTlncRNAs and mRNAs with 0.6 for Matthews correlation coefficient, 64% for sensitivity, and 81% for precision for the independent human test set. The most discriminative features are related to sequence conservations at RNA (for separating from HTlncRNAs) and protein (for separating from mRNA) levels.The method is found to be robust as the human-RNA-trained model is applicable to independent mouse RNAs with similar accuracy and to a lesser extent to plant RNAs. The method can recover newly discovered EVlncRNAs with high sensitivity. Its application to randomly selected 2000 human HTlncRNAs indicates that a large number of functional lncRNAs are waiting to be validated.The method is expected to speed up and reduce the cost of the discovery by prioritizing potentially functional lncRNAs prior to experimental validation.EVlncRNA-pred is available as a web server at http://biophy.dzu.edu.cn/lncrnapred/index.html. All datasets used in this study can be obtained from the same website.

show abstract

Genome-scale characterization of RNA tertiary structures and their functional impact by RNA solvent accessibility prediction

Cited by 31 publications

References 57 publications

Accurate Prediction of Genome-wide RNA Secondary Structure Profile Based On Extreme Gradient Boosting

Accurate Prediction of Genome-wide RNA Secondary Structure Profile Based On Extreme Gradient Boosting

B‐factor profile prediction for RNA flexibility using support vector machines

Predicting functional long non-coding RNAs validated by low throughput experiments

Contact Info

Product

Resources

About