Synthetic biology often involves engineering microbial strains to express high-value proteins. Thanks to progress in rapid DNA synthesis and sequencing, deep learning has emerged as a promising approach to build sequence-to-expression models for strain optimization. But such models need large and costly training data that create steep entry barriers for many laboratories. Here we study the relation between accuracy and data efficiency in an atlas of machine learning models trained on datasets of varied size and sequence diversity. We show that deep learning can achieve good prediction accuracy with much smaller datasets than previously thought. We demonstrate that controlled sequence diversity leads to substantial gains in data efficiency and employed Explainable AI to show that convolutional neural networks can finely discriminate between input DNA sequences. Our results provide guidelines for designing genotype-phenotype screens that balance cost and quality of training data, thus helping promote the wider adoption of deep learning in the biotechnology sector.
Synthetic gene circuits perturb the physiology of their cellular host. The extra load on endogenous processes shifts the equilibrium of resource allocation in the host, leading to slow growth and reduced biosynthesis. Here we built integrated host-circuit models to quantify growth defects caused by synthetic gene circuits. Simulations reveal a complex relation between circuit output and cellular capacity for gene expression. For weak induction of heterologous genes, protein output can be increased at the expense of growth defects. Yet for stronger induction, cellular capacity reaches a tipping point, beyond which both gene expression and growth rate drop sharply. Extensive simulations across various growth conditions and large regions of the design space suggest that the critical capacity is a result of ribosomal scarcity. We studied the impact of growth defects on various gene circuits and transcriptional logic gates, which highlights the extent to which cellular burden can limit, shape and even break down circuit function. Our approach offers a comprehensive framework to assess the impact of host-circuit interactions in silico, with wide-ranging implications for the design and optimization of bacterial gene circuits.
Synthetic gene circuits perturb the physiology of their cellular host. The extra load on endogenous processes shifts the equilibrium of resource allocation in the host, leading to slow growth and reduced biosynthesis. Here we built integrated host-circuit models to quantify growth defects caused by synthetic gene circuits. Simulations reveal a complex relation between circuit output and cellular capacity for gene expression. For weak induction of heterologous genes, protein output can be increased at the expense of growth defects. Yet for stronger induction, cellular capacity reaches a tipping point, beyond which both gene expression and growth rate drop sharply. Extensive simulations across various growth conditions and large regions of the design space suggest that the critical capacity is a result of ribosomal scarcity. We studied the impact of growth defects on various gene circuits and transcriptional logic gates, which highlights the extent to which cellular burden can limit, shape and even break down circuit function. Our approach offers a comprehensive framework to assess the impact of host-circuit interactions in silico, with wide-ranging implications for the design and optimization of bacterial gene circuits.
Recent progress in laboratory automation has enabled rapid and large-scale characterization of strains engineered to express heterologous proteins, paving the way for the use of machine learning to optimize production phenotypes. The ability to predict protein expression from DNA sequence promises to deliver large efficiency gains and reduced costs for strain design. Yet it remains unclear which models are best suited for this task or what is the size of training data required for accurate prediction. Here we trained and compared thousands of predictive models of protein expression from sequence, using a large screen of Escherichia coli strains with varying levels of GFP expression. We consider models of increasing complexity, from linear regressors to convolutional neural networks, trained on datasets of variable size and sequence diversity. Our results highlight trade-offs between prediction accuracy, data diversity, and DNA encoding methods. We provide robust evidence that deep neural networks can outperform classic models with the same amount of training data, achieving prediction accuracy over 80% when trained on approximately 2,000 sequences. Using techniques from Explainable AI, we show that deep learning models capture sequence elements that are known to correlate with expression, such as the stability of mRNA secondary structure. Our results lay the groundwork for the more widespread adoption of deep learning for strain engineering across the biotechnology sector.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.