Motivation: Determining RNA binding protein(RBP) binding specificity is crucial for understanding many cellular processes and genetic disorders. RBP binding is known to be affected by both the sequence and structure of RNAs. Deep learning can be used to learn generalizable representations of raw data and has improved state of the art in several fields such as image classification, speech recognition and even genomics. Previous work on RBP binding has either used shallow models that combine sequence and structure or deep models that use only the sequence. Here we combine both abilities by augmenting and refining the original Deepbind architecture to capture structural information and obtain significantly better performance. Results: We propose two deep architectures, one a lightweight convolutional network for transcriptome wide inference and another a Long Short-Term Memory(LSTM) network that is suitable for small batches of data. We incorporate computationally predicted secondary structure features as input to our models and show its effectiveness in boosting prediction performance. Our models achieved significantly higher correlations on held out in-vitro test data compared to previous approaches, and generalise well to in-vivo CLIP-SEQ data achieving higher median AUCs than other approaches. We analysed the output from our model for VTS1 and CPO and provided intuition into its working. Our models confirmed known secondary structure preferences for some proteins as well as found new ones where secondary structure might play a role. We also demonstrated the strengths of our model compared to other approaches such as the ability to combine information from long distances along the input. Availability: Software and models are available at https://github.com/shreshthgandhi/cDeepbind Contact:
When estimating expression of a transcript or part of a transcript using RNA-seq data, it is commonly assumed that reads are generated uniformly from positions within the transcript. While this assumption is acceptable for long transcript sequences where reads from many positions are averaged, it frequently leads to large errors for short sequences, e.g., less than 100 bp. Analysis of short sequences, such as when studying splice junctions and microRNAs, is increasingly important and necessitates addressing errors in short-sequence expression estimation. Indeed, when we examined RNA-seq data from diverse studies, we found that large errors are introduced by variations in RNA-seq coverage due to sequence content, experimental conditions and sample preparation. 1We developed a technique that we call the positional bootstrap, which quantifies the level of uncertainty in expression induced by nonuniform coverage. Unlike methods that attempt to correct for biases in coverage, but do so by making strong assumptions about the form of those biases, the positional bootstrap can quantify the noise induced by all types of bias, including unknown ones. Results obtained using independently generated RNA-seq datasets show that the positional bootstrap increases the accuracy of estimates of alternative splicing levels, tissue-differential alternative splicing and tissue differential expression, by a factor of up to 10.A Python implementation of the algorithm to quantify splicing levels is freely available from github.com/PSI-Lab/BENTO-Seq.
Background: Accurate prediction of epitopes presented by human leukocyte antigen (HLA) is crucial for personalized cancer immunotherapies targeting T cell epitopes. Mass spectrometry (MS) profiling of eluted HLA ligands, which provides unbiased, high-throughput measurements of HLA associated peptides in vivo, could be used to faithfully model the presentation of epitopes on the cell surface. In addition, gene expression profiles measured by RNA-seq data in a specific cell/tissue type can significantly improve the performance of epitope presentation prediction. However, although large amount of high-quality MS data of HLA-bound peptides is being generated in recent years, few provide matching RNA-seq data, which makes incorporating gene expression into epitope prediction difficult. Methods:We collected publicly available HLA peptidome and matching RNA-seq data of 34 cell lines derived from various sources. We built position score specific matrixes (PSSMs) for 21 HLA-I alleles based on these MS data, then used logistic regression (LR) to model the relationship among PSSM score, gene expression and peptide length to predict whether a peptide could be presented in each of the cell line. Comparing the feature weights and biases across different HLA-I alleles and cell lines, we observed a universal relationship among these three variables. To confirm this, we built a single LR model by pooling PSSM scores, gene expression levels and peptide length features across different HLA alleles and cell lines, and compared its performance with the allele and cell line specific LR models. Indeed, the predictive powers had no significant differences across cell lines and HLA alleles, and both substantially outperformed predictions based on PSSM scores alone. Based on such a finding, we further built a universal LR model, termed Epitope Presentation Integrated prediCtion (EPIC), based on more than 180,000 unique HLA ligands collected from public sources and ~3,000 HLA ligands generated by ourselves, to predict epitope presentation for 66 common HLA-I alleles.Results: When evaluating EPIC on large, independent HLA eluted ligand datasets, it performed substantially better than other popular methods, including MixMHCpred (v2.0), NetMHCpan (v4.0), and MHCflurry (v1.2.2), with an average 0.1% positive predictive value (PPV) of 51.59%, compared to 36.98%, 36.41%, 24.67% and 23.39% achieved by MixMHCpred, NetMHCpan-4.0 (EL), NetMHCpan-4.0 (BA) and MHCflurry, respectively. It is also comparable to EDGE, a recent deep learning-based model that is not yet publicly available, on predicting epitope presentation and selecting immunogenic cancer neoantigens. However, the simplicity and flexibility of EPIC makes it much easier to be applied in diverse situations, especially when users would like to take advantage of emerging eluted ligand data for new HLA alleles. We demonstrated this by generating MS data for the HCC4006 cell line and adding the support of HLA-A*33:03, which has no previous MS or binding affinity data available, to EPIC. EPIC is...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.