2021
DOI: 10.48550/arxiv.2109.07925
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

PDBench: Evaluating Computational Methods for Protein Sequence Design

Abstract: Proteins perform critical processes in all living systems: converting solar energy into chemical energy, replicating DNA, as the basis of highly performant materials, sensing and much more. While an incredible range of functionality has been sampled in nature, it accounts for a tiny fraction of the possible protein universe. If we could tap into this pool of unexplored protein structures, we could search for novel proteins with useful properties that we could apply to tackle the environmental and medical chall… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(4 citation statements)
references
References 22 publications
0
4
0
Order By: Relevance
“…CAFA [143] and CAGI [144] , however they focus on scoring how in-silico tools predict known function from sequence, rather than their ability to infer proteins (sequences or structures) that perform a desired, sometimes non-naturally found function. Conversely, model developers score their tools by metrics like Natural Sequence Recovery (NSR), which validate a model’s ability to link structure to sequence, but often not a model’s ability to generate diverse sequences fitting a desired structure [56] . A recent push in benchmarks scoring models’ ability to engineer proteins [99] , [145] , [146] , [147] , [148] highlights three aspects that protein design tools should strive to solve: 1) emulate laboratory conditions, i.e., extrapolate from very little available data; 2) set multiple function generalization goals, i.e, measure different aspects of function, with the intent of finding an optimal solution rather than maximizing any one metric; 3) focus on the ability to address out of observed data distributions, i.e., design proteins that achieve functions not observed in nature.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…CAFA [143] and CAGI [144] , however they focus on scoring how in-silico tools predict known function from sequence, rather than their ability to infer proteins (sequences or structures) that perform a desired, sometimes non-naturally found function. Conversely, model developers score their tools by metrics like Natural Sequence Recovery (NSR), which validate a model’s ability to link structure to sequence, but often not a model’s ability to generate diverse sequences fitting a desired structure [56] . A recent push in benchmarks scoring models’ ability to engineer proteins [99] , [145] , [146] , [147] , [148] highlights three aspects that protein design tools should strive to solve: 1) emulate laboratory conditions, i.e., extrapolate from very little available data; 2) set multiple function generalization goals, i.e, measure different aspects of function, with the intent of finding an optimal solution rather than maximizing any one metric; 3) focus on the ability to address out of observed data distributions, i.e., design proteins that achieve functions not observed in nature.…”
Section: Discussionmentioning
confidence: 99%
“…The performance of these methods is usually evaluated by native sequence recovery (NSR), i.e., the percentage of wild-type amino acids recovered for an input sequence by the design method. While this metric imposes some limitations, given that the identity percentage does not necessarily correlate with expression or functional levels [56] , it is nevertheless a convenient measure to evaluate how well the method recapitulates wild-type sequences. Some of the first attempts came from SPIN [57] and SPIN2 [58,p.…”
Section: The Deep Learning Era Of Protein Sequence and Structure Gene...mentioning
confidence: 99%
“…CAFA [128] and CAGI [129], however they focus on scoring how in-silico tools predict known function from sequence, rather than their ability to infer proteins (sequences or structures) that perform a desired, sometimes non-naturally found function. Conversely, model developers score their tools by metrics like Natural Sequence Recovery (NSR), which validate a model's ability to link structure to sequence, but often not a model's ability to generate diverse sequences fitting a desired structure [53]. A recent push in benchmarks scoring models' ability to engineer proteins [130]- [134] highlights three aspects that protein design tools should strive to solve: 1) emulate laboratory conditions, i.e., extrapolate from very little available data; 2) set multiple function generalization goals, i.e, measure different aspects of function, with the intent of finding an optimal solution rather than maximizing any one metric; 3) focus on the ability to address out of observed data distributions, i.e., design proteins that achieve functions not observed in nature.…”
Section: Discussionmentioning
confidence: 99%
“…While this metric imposes some limitations, given that the identity percentage does not necessarily correlate with expression or functional levels [53] and the lack of common benchmark sequences, it is nevertheless a convenient measure to evaluate how well the method recapitulates wild-type sequences. Some of the first attempts came from SPIN [54] and SPIN2 [55, p. 2], which leveraged three-layered fully-connected neural networks (FNN) to learn from structural features embedded as a 1-dimensional (1D) tensor representing backbone torsion angles, local fragment-derived profiles, and global energy-based features.…”
Section: Introductionmentioning
confidence: 99%