Data and scripts for "Evaluation of 244,000 synthetic sequences reveals design principles to optimize translation in Escherichia coli"

Guillaume,

doi:10.17605/osf.io/a56vu

2022

DOI: 10.17605/osf.io/a56vu

|View full text |Cite

Data and scripts for "Evaluation of 244,000 synthetic sequences reveals design principles to optimize translation in Escherichia coli"

Guillaume Guillaume¹

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2022

Publication Types

Select...

Article1

Relationship

Self Cite1

Independent0

Authors

Journals

Cited by 1 publication

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Accuracy and data efficiency in deep learning models of protein expression

et al. 2022

Self Cite

View full text Add to dashboard Cite

Synthetic biology often involves engineering microbial strains to express high-value proteins. Thanks to progress in rapid DNA synthesis and sequencing, deep learning has emerged as a promising approach to build sequence-to-expression models for strain optimization. But such models need large and costly training data that create steep entry barriers for many laboratories. Here we study the relation between accuracy and data efficiency in an atlas of machine learning models trained on datasets of varied size and sequence diversity. We show that deep learning can achieve good prediction accuracy with much smaller datasets than previously thought. We demonstrate that controlled sequence diversity leads to substantial gains in data efficiency and employed Explainable AI to show that convolutional neural networks can finely discriminate between input DNA sequences. Our results provide guidelines for designing genotype-phenotype screens that balance cost and quality of training data, thus helping promote the wider adoption of deep learning in the biotechnology sector.

show abstract

Accuracy and data efficiency in deep learning models of protein expression

et al. 2022

Self Cite

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Data and scripts for "Evaluation of 244,000 synthetic sequences reveals design principles to optimize translation in Escherichia coli"

Cited by 1 publication

References 0 publications

Accuracy and data efficiency in deep learning models of protein expression

Accuracy and data efficiency in deep learning models of protein expression

Contact Info

Product

Resources

About