Predicting effects of gene regulatory elements (GREs) is a longstanding challenge in biology. Machine learning may address this, but requires large datasets linking GREs to their quantitative function. However, experimental methods to generate such datasets are either application-specific or technically complex and error-prone. Here, we introduce DNA-based phenotypic recording as a widely applicable, practicable approach to generate large-scale sequence-function datasets. We use a site-specific recombinase to directly record a GRE's effect in DNA, enabling readout of both sequence and quantitative function for extremely large GRE-sets via next-generation sequencing. We record translation kinetics of over 300,000 bacterial ribosome binding sites (RBSs) in >2.7 million sequence-function pairs in a single experiment. Further, we introduce a deep learning approach employing ensembling and uncertainty modelling that predicts RBS function with high accuracy, outperforming state-ofthe-art methods. DNA-based phenotypic recording combined with deep learning represents a major advance in our ability to predict function from genetic sequence.
10Predicting quantitative effects of gene regulatory elements (GREs) on gene expression is a longstanding 11 challenge in biology. Machine learning models for gene expression prediction may be able to address 12 this challenge, but they require experimental datasets that link large numbers of GREs to their 13 quantitative effect. However, current methods to generate such datasets experimentally are either 14 restricted to specific applications or limited by their technical complexity and error-proneness. Here we 15 introduce DNA-based phenotypic recording as a widely applicable and practical approach to generate 16 very large datasets linking GREs to quantitative functional readouts of high precision, temporal 17 resolution, and dynamic range, solely relying on sequencing. This is enabled by a novel DNA 18 architecture comprising a site-specific recombinase, a GRE that controls recombinase expression, and a 19 DNA substrate modifiable by the recombinase. Both GRE sequence and substrate state can be 20 determined in a single sequencing read, and the frequency of modified substrates amongst constructs 21 harbouring the same GRE is a quantitative, internally normalized readout of this GRE's effect on 22 recombinase expression. Using next-generation sequencing, the quantitative expression effect of 23 extremely large GRE sets can be assessed in parallel. As a proof of principle, we apply this approach to 24 record translation kinetics of more than 300,000 bacterial ribosome binding sites (RBSs), collecting over 25 2.7 million sequence-function pairs in a single experiment. Further, we generalize from these large-scale 26Recent progress in DNA sequencing and synthesis has facilitated reading and (re-)writing of the genetic 33 makeup of biological systems on a massive scale 1,2 . Despite this progress, the relationship between a 34 genetic sequence and its functional properties is poorly understood, and thus the question "what to write" 35 remains largely unanswered 3,4 . Since the number of possible sequences scales exponentially with their 36 length, the theoretical sequence space cannot be exhaustively explored by experiments, even for small 37GREs 5-7 . Therefore, innovative high-throughput (HTP) approaches are required that allow to collect a 38 quantitative functional readout for large numbers of genetic sequences 7,8 . At the same time, novel 39 methods are required that identify statistical patterns and dependencies in the resulting datasets to 40 generate models that accurately predict the properties of untested sequences. Deep learning maximizes 41 the benefit of data collection at large scale owing to its ability to capture complex, nonlinear 42 dependencies and to its computational scalability 9 , which led to several successful applications in 43 computational biology, from genomics to proteomics 10-15 . These methods promise to be able to model 44 sequence-function dependencies with minimal prior assumptions, provided that large experimental 45 training datasets that link sequence to quantitative measure ...
Motivation Gaining a comprehensive understanding of the genetics underlying cancer development and progression is a central goal of biomedical research. Its accomplishment promises key mechanistic, diagnostic and therapeutic insights. One major step in this direction is the identification of genes that drive the emergence of tumors upon mutation. Recent advances in the field of computational biology have shown the potential of combining genetic summary statistics that represent the mutational burden in genes with biological networks, such as protein–protein interaction networks, to identify cancer driver genes. Those approaches superimpose the summary statistics on the nodes in the network, followed by an unsupervised propagation of the node scores through the network. However, this unsupervised setting does not leverage any knowledge on well-established cancer genes, a potentially valuable resource to improve the identification of novel cancer drivers. Results We develop a novel node embedding that enables classification of cancer driver genes in a supervised setting. The embedding combines a representation of the mutation score distribution in a node’s local neighborhood with network propagation. We leverage the knowledge of well-established cancer driver genes to define a positive class, resulting in a partially labeled dataset, and develop a cross-validation scheme to enable supervised prediction. The proposed node embedding followed by a supervised classification improves the predictive performance compared with baseline methods and yields a set of promising genes that constitute candidates for further biological validation. Availability and implementation Code available at https://github.com/BorgwardtLab/MoProEmbeddings. Supplementary information Supplementary data are available at Bioinformatics online.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.