2020
DOI: 10.1101/2020.01.23.917682
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Low-N protein engineering with data-efficient deep learning

Abstract: Protein engineering has enormous academic and industrial potential. However, it is limited by the lack of experimental assays that are consistent with the design goal and sufficiently high-throughput to find rare, enhanced variants. Here we introduce a machine learning-guided paradigm that can use as few as 24 functionally assayed mutant sequences to build an accurate virtual fitness landscape and screen ten million sequences via in silico directed evolution. As demonstrated in two highly dissimilar proteins, … Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

3
191
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 97 publications
(194 citation statements)
references
References 79 publications
3
191
0
Order By: Relevance
“…(2-3), maximizing the joint probability of sequence and structure in Eq. (1) is equivalent to maximizing the following objective: (4) P(structure) is a fixed distribution which depends only on the protein length and is generated only once at the beginning of simulations; f a PDB is fixed too; hence the optimization focuses on maximizing D KL . The design procedure starts off with picking a random amino acid sequence of a given length L ( L = 100 throughout the study), passing it through trRosetta and background networks and calculating the objective F according to Eq.(4).…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…(2-3), maximizing the joint probability of sequence and structure in Eq. (1) is equivalent to maximizing the following objective: (4) P(structure) is a fixed distribution which depends only on the protein length and is generated only once at the beginning of simulations; f a PDB is fixed too; hence the optimization focuses on maximizing D KL . The design procedure starts off with picking a random amino acid sequence of a given length L ( L = 100 throughout the study), passing it through trRosetta and background networks and calculating the objective F according to Eq.(4).…”
Section: Methodsmentioning
confidence: 99%
“…Deep learning methods have shown considerable promise in protein engineering. Networks with architectures borrowed from language models have been trained on amino acid sequences, and been used to generate new sequences without considering protein structure explicitly 4,5 . Other methods have been developed to generate protein backbones without consideration of sequence 6 , and to identify amino acid sequences which either fit well onto specified backbone structures [7][8][9] or are conditioned on low-dimensional fold representation 10 ; models tailored to generate sequences and/or structures for specific protein families have also been developed [11][12][13][14] .…”
Section: Introductionmentioning
confidence: 99%
“…In addition to discovering highly functional variants, another benefit of this approach is the opportunity to learn from the numerous suboptimal variants. Machine learning algorithms trained to predict functional activity from protein sequence can assist in elucidating the biochemical determinants of function and predict additional sequences to test (Alley et al, 2019;Bedbrook et al, 2019;Biswas et al, 2020;Wu et al, 2019;Xu et al, 2020;Yang et al, 2018). To this end, the PyronicSF linker sequences were encoded as numerical vectors using the VHSE amino acid descriptor (8 principal components score v ectors derived from h ydrophobic, s teric, and e lectronic properties) (Mei et al, 2005).…”
Section: Sort-seq Assay Of a Pyruvate Biosensor Linker Librarymentioning
confidence: 99%
“…Successfully addressing this core problem promises to transform the field, leading to better proteins for industry and medicine at a fraction of the cost. A number of ML methods have been implemented to address this 1 , including Gaussian process regression [2][3][4][5] , unsupervised statistical analyses 6 , deep neural networks and sequence models [7][8][9][10][11] . However, a uniformly used set of objectives and benchmarks against which each architecture can be evaluated is currently unavailable.…”
Section: Introductionmentioning
confidence: 99%