2018
DOI: 10.1101/337154
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Toward machine-guided design of proteins

Abstract: Proteins-molecular machines that underpin all biological life-are of significant therapeutic and industrial value. Directed evolution is a high-throughput experimental approach for improving protein function, but has difficulty escaping local maxima in the fitness landscape. Here, we investigate how supervised learning in a closed loop with DNA synthesis and high-throughput screening can be used to improve protein design. Using the green fluorescent protein (GFP) as an illustrative example, we demonstrate the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

2
54
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 39 publications
(56 citation statements)
references
References 17 publications
2
54
0
Order By: Relevance
“…Unlike other approaches to representing proteins, namely as one-hot-encoded matrices as in Biswas et. al 2018 3 , RNNs produce fixed-length representations for arbitrary-length proteins by extracting the hidden state passed forward along a sequence. While padding to the maximum sequence length can in principle mitigate the problem of variable length sequences in a one hot encoding, it is ad-hoc, can add artifacts to training, wastes computation processing padding characters, and provides no additional information to a top model besides the naive sequence.…”
Section: Models and Training Detailsmentioning
confidence: 99%
See 2 more Smart Citations
“…Unlike other approaches to representing proteins, namely as one-hot-encoded matrices as in Biswas et. al 2018 3 , RNNs produce fixed-length representations for arbitrary-length proteins by extracting the hidden state passed forward along a sequence. While padding to the maximum sequence length can in principle mitigate the problem of variable length sequences in a one hot encoding, it is ad-hoc, can add artifacts to training, wastes computation processing padding characters, and provides no additional information to a top model besides the naive sequence.…”
Section: Models and Training Detailsmentioning
confidence: 99%
“…Traditional approaches to protein engineering rely on random variation and screening/selection without modelling the relationship between sequence and function 1,2 . In contrast, rational engineering approaches seek to build quantitative models of protein properties, and use these models to more efficiently traverse the fitness landscape to overcome the challenges of directed evolution [3][4][5][6][7][8][9] . Such rational design requires a holistic and predictive understanding of structural stability and quantitative molecular function that has not been consolidated in a generalizable framework to date.…”
mentioning
confidence: 99%
See 1 more Smart Citation
“…Another strategy has been to assume that the observed phenotype is a simple non-linear function of some underlying nonepistatic trait [32,40], a pattern of epistasis known as univariate [8,24], non-specific [31] or global [40,41] epistasis, which appears to be well suited-primarily to sequence-function relationships that are essentially noised versions of single-peaked landscapes. Finally, a variety of machine-learning techniques [8,12,[42][43][44][45] have been employed that can fit more complex forms of epistasis than global epistasis or pairwise interaction models. However, these require substantial tuning and the resulting models exhibit behavior that is difficult to interpret.…”
Section: Introductionmentioning
confidence: 99%
“…These models are now beginning to be combined with search heuristics and high-throughput assays to forward-engineer DNA and protein sequences (Rocklin et. al., 2017, Biswas et. al., 2018Sample et.…”
mentioning
confidence: 99%