A single gene can encode for different protein versions through a process called alternative splicing. Since proteins play major roles in cellular functions, aberrant splicing profiles can result in a variety of diseases, including cancers. Alternative splicing is determined by the gene's primary sequence and other regulatory factors such as RNA-binding protein levels. With these as input, we formulate the prediction of RNA splicing as a regression task and build a new training dataset (CAPD) to benchmark learned models. We propose discrete compositional energy network (DCEN) which leverages the hierarchical relationships between splice sites, junctions and transcripts to approach this task. In the case of alternative splicing prediction, DCEN models mRNA transcript probabilities through its constituent splice junctions' energy values. These transcript probabilities are subsequently mapped to relative abundance values of key nucleotides and trained with ground-truth experimental measurements. Through our experiments on CAPD 1 , we show that DCEN outperforms baselines and ablation variants. 2 CCS CONCEPTS• Applied computing → Bioinformatics; Health informatics; • Computing methodologies → Neural networks.
A single gene can encode for different protein versions through a process called alternative splicing. Since proteins play major roles in cellular functions, aberrant splicing profiles can result in a variety of diseases, including cancers. Alternative splicing is determined by the gene's primary sequence and other regulatory factors such as RNA-binding protein levels. With these as input, we formulate the prediction of RNA splicing as a regression task and build a new training dataset (CAPD) to benchmark learned models. We propose discrete compositional energy network (DCEN) which leverages the hierarchical relationships between splice sites, junctions and transcripts to approach this task. In the case of alternative splicing prediction, DCEN models mRNA transcript probabilities through its constituent splice junctions' energy values. These transcript probabilities are subsequently mapped to relative abundance values of key nucleotides and trained with ground-truth experimental measurements. Through our experiments on CAPD 1 , we show that DCEN outperforms baselines and ablation variants. 2 CCS CONCEPTS• Applied computing → Bioinformatics; Health informatics; • Computing methodologies → Neural networks.
Human DNA sequence determines the cellular fate through transcription to RNA and translation to proteins. DNA and RNA undergo extensive processing in the cells based on the sequence and cellular state, where alternative splicing in particular determines RNA isoform choice. In the recent years, in addition to the sequence of nucleic acid, its structure and epigenetic landscape have been shown to play important roles in cellular functions. For example, DNA and RNA G-quadruplex (G4) structures were found to affect oncogene expression and to be attractive therapeutic targets. This thesis mainly focuses on the DNA and RNA G4 structure formation and RNA splicing in cellular contexts.First, I analyse the formation of irregular G4 forming motifs using reported experimental data on DNA G4 in cells. I find possible correlations of DNA G4 formation with contextual epigenetic features using neural networks and propose a deep learning-based method for G4 prediction in cells. Motivated by the scarcity of RNA G4 probing methods, I additionally propose a method for detection of RNA G-quadruplexes in long RNA with direct nanopore sequencing. Contextual machine learning is further applied to predict alternative RNA splicing in cells using RNA-binding protein (RBP) levels. In summary, I have developed deep learning methods for prediction of G4 structure formation and RNA splicing in cells, which will help to advance our understanding of DNA/RNA structure and processing in different cellular contexts.
G-quadruplexes (G4s) are secondary structures abundant in DNA that may play regulatory roles in cells. Despite the ubiquity of the putative G-quadruplex sequences (PQS) in the human genome, only a small fraction forms secondary structures in cells. Folded G4, histone methylation and chromatin accessibility are all parts of the complex cis regulatory landscape. We propose an approach for G4 formation prediction in cells that incorporates epigenetic and chromatin accessibility data. The novel approach termed epiG4NN efficiently predicts cell-specific G4 formation in live cells based on a local epigenomic snapshot. Our architecture confirms the close relationship between H3K4me3 histone methylation, chromatin accessibility and G4 structure formation. Trained on A549 cell data, epiG4NN was then able to predict G4x formation in HEK293T and K562 cell lines. We observe the dependency of model performance with different epigenetic features on the underlying experimental condition of G4 detection. We expect that this approach will contribute to the systematic understanding of correlations between structural and epigenomic feature landscape.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.