Disease diagnosis and treatment is challenging in part due to the misalignment of diagnostic categories with the underlying biology of disease. The evaluation of large-scale genomic experimental datasets is a compelling approach to refining the classification of biological concepts, such as disease. Well-established approaches, some of which rely on information theory or network analysis, quantitatively assess relationships among biological entities using gene annotations, structured vocabularies, and curated data sources. However, the gene annotations used in these evaluations are often sparse, potentially biased due to uneven study and representation in the literature, and constrained to the single species from which they were derived. In order to overcome these deficiencies inherent in the structure and sparsity of these annotated datasets, we developed a novel Network Enhanced Similarity Search (NESS) tool which takes advantage of multi-species networks of heterogeneous data to bridge sparsely populated datasets.NESS employs a random walk with restart algorithm across harmonized multi-species data, effectively compensating for sparsely populated and noisy genomic studies. We further demonstrate that it is highly resistant to spurious or sparse datasets and generates significantly better recapitulation of ground truth biological pathways than other similarity metrics alone. Furthermore, since NESS has been deployed as an embedded tool in the GeneWeaver environment, it can rapidly take advantage of curated multi-species networks to provide informative assertions of relatedness of any pair of biological entities or concepts, e.g., gene-gene, gene-disease, or phenotype-disease associations. NESS ultimately enables multi-species analysis applications to leverage model organism data to overcome the challenge of data sparsity in the study of human disease. Availability and Implementation: Implementation available at https://geneweaver.org/ness. Source code freely available at https://github.com/treynr/ness.
Author summaryFinding consensus among large-scale genomic datasets is an ongoing challenge in the biomedical sciences. Harmonizing and analyzing such data is important because it November 25, 2019 1/19 allows researchers to mitigate the idiosyncrasies of experimental systems, alleviate study biases, and augment sparse datasets. Additionally, it allows researchers to utilize animal model studies and cross-species experiments to better understand biological function in health and disease. Here we provide a tool for integrating and analyzing heterogeneous functional genomics data using a graph-based model. We show how this type of analysis can be used to identify similar relationships among biological entities such as genes, processes, and disease through shared genomic associations. Our results indicate this approach is effective at reducing biases caused by sparse and noisy datasets. We show how this type of analysis can be used to aid the classification gene function and prioritization of genes involved in substa...