<p><b>Genome-wide association analyses (GWAS) studies based on frequentist statistics have often proven ineffective in deriving biological insights from sequencing data. These GWAS lack the machinery to safeguard against technical noise inherent to high throughput sequencing platforms and are not conceptually designed for processing large sets of high-dimensional genomic data. However, such shortcomings are not peculiar to GWAS and have been studied in other fields of science, such as signal processing and computer science, for a long time. In particular, machine learning techniques, especially deep learning models, have proven highly successful in dealing with noisy high-dimensional data. Recently it has been shown that these techniques can be effective for handling genomic data even when directly transferred from modern computer vision and natural language processing applications. </b></p>
<p>This thesis builds off the existing suites of such methodologies and presents a robust computational pipeline to functionally annotate whole-genome sequencing data. Moreover, it discusses and presents a data solution to efficiently process the large, heterogeneous datasets required for such analyses. The main objective of this thesis is to put forward a solution to identify variants that modify disease-causing mutations of complex heritable diseases. This is not a trivial problem given that the current gold standard approach, GWAS methodology, suffers not only from the drawbacks just described but is also underpowered by multiple testing (not useful for rare diseases) and fails to account for the epistatic nature of genetic interactions responsible for the onset and manifestation of complex diseases.</p>
<p>Here, a set of cell-specific Gene Regulatory Networks (GRNs) inferred from dynamic genomic data was constructed. Most attempts to construct GRNs delineating such complex interactions relied on combining non-standardized high-throughput static datasets that contained false positive interactions and missing data points without insights into cell developmental states. To illuminate these intricate dynamic regulatory interconnections of the genome, specific to a tissue or a cell type, the Non-Stiff Dynamic Invertible Model of CO-Regulatory Networks (NS-DIMCORN) that allows unrestricted neural network architectures (to accommodate arbitrary depth increase for larger sets of genes) and training without partitioning the data dimensions was developed. NS-DIMCORN was trained on not-homogenized bulk tissue-specific RNA-seq and single-cell RNA-seq as a surrogate for cells’ continuous developmental states and modeled these highly dynamic systems with a set of ordinary differential equations. NS-DIMCORN yielded a continuous-time invertible generative model with unbiased density estimation only from RNA-seq read-count data and allowed time-flexible sampling of each gene’s expression level for ab initioassembly of genes regulatory network of specific cells.</p>
<p>Secondly, Precise Graph-based Genome-Wide Annotation Sofware (PG-GWAS) was developed. For this purpose, embedding was used to map genomic variables to a vector of continuous numbers. Thus, each genomic variant was assigned a unique contextualized score that encoded the likelihood of effects on its respective gene products. These scores were pan-genomic by constructing a k-mer representation of all the haplotypes, independent of any “reference genome,” and were based only on each variant’s evolutionary constraints. Next, a graph representation of individuals’ genomes was constructed that integrated genomic variation scores, tissue-specific gene-gene interaction, and regulatory networks (assembled from GRNs) to allow the study of the genomic variants in aggregate and accounting for epistasis. Utilizing the Graph Attention mechanism identified these networks’ most critical interactions and allowed annotating the entire whole-genome graphs to determine the most prominent genomic features (i.e., groups of interacting genes) within each genome that could be responsible for different symptoms and onset in patients with the same disease-causing mutations. Eventually, to demonstrate the efficacy of this approach, PG-GWAS was tested on new sets of sequencing data, where the result improved in standard GWAS and provided insight into disease epistasis.</p>