2023
DOI: 10.1101/2023.12.07.570727
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction

Pascal Notin,
Aaron W. Kollasch,
Daniel Ritter
et al.

Abstract: Predicting the effects of mutations in proteins is critical to many applications, from understanding genetic disease to designing novel proteins that can address our most pressing challenges in climate, agriculture and healthcare. Despite a surge in machine learning-based protein models to tackle these questions, an assessment of their respective benefits is challenging due to the use of distinct, often contrived, experimental datasets, and the variable performance of models across different protein families. … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
31
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 81 publications
(32 citation statements)
references
References 223 publications
1
31
0
Order By: Relevance
“…Using this data, we applied the GEMME model, which evaluates both individual residue and residue-pair conservation to calculate evolutionary distance scores (ΔE) 28 and which has previously been shown to produce state-of-the-art predictions of variant effects. 45 Within this framework, neutral (WT-like) amino acid substitutions that are compatible with the alignments and either do not or only minimally impact protein stability and function have ΔE scores near zero. In contrast, substitutions with large negative ΔE scores are incompatible with the sequence alignment and are likely to be unfavorable.…”
Section: ■ Results and Discussionmentioning
confidence: 99%
“…Using this data, we applied the GEMME model, which evaluates both individual residue and residue-pair conservation to calculate evolutionary distance scores (ΔE) 28 and which has previously been shown to produce state-of-the-art predictions of variant effects. 45 Within this framework, neutral (WT-like) amino acid substitutions that are compatible with the alignments and either do not or only minimally impact protein stability and function have ΔE scores near zero. In contrast, substitutions with large negative ΔE scores are incompatible with the sequence alignment and are likely to be unfavorable.…”
Section: ■ Results and Discussionmentioning
confidence: 99%
“…S4), though the appropriate type of non-linear model to use is an active area of study 54,77,78 . We hope that our fitness landscape dataset will contribute to the development of improved methods for protein fitness prediction 53,76 .…”
Section: Discussionmentioning
confidence: 99%
“…Our ML-guided design framework (Figure 1) is largely complementary to ongoing advancements in the modeling of protein function. For example, protein language models 69,70,77,[86][87][88][89][90] , could be used in Prosar+Screen to filter multi-mutant variants or in MBO-DNN as proposal distributions or fitness predictors 53,75 . Prior knowledge derived from physics 91 , protein structure 89,92 , or experimental observations from related campaigns 93,94 could also be used to further improve the ability of models to extrapolate to more distant sequences accurately.…”
Section: Discussionmentioning
confidence: 99%
“…We chose to compare our methods to the following five state-of-the-art methods on the ProteinGym test set. GEMME (Laine, Karami, and Carbone 2019) as it is (1) tied for the best performing method on ProteinGym (Notin et al 2023), (2) a purely MSA-based method not using machine learning, and (3) was used to annotate VespaG's training data. TranceptEVE L (Notin, Niekerk, et al 2022) as it is the best performing method on ProteinGym next to GEMME, and because it is a hybrid model, making use of both MSAs and pLM embeddings as input, combining the previously developed autoregressive Tranception (Notin, Dias, et al 2022) with the Bayesian variational autoencoder EVE (Frazer et al 2021).…”
Section: Comparison To State-of-the-art Methodsmentioning
confidence: 99%
“…The log-odds ratios of the amino acid probabilities computed from the pretext reconstruction task already provide reasonably accurate estimates of variant effects (Meier et al 2021). Nevertheless, they are not competitive with state-of-the-art (SOTA) methods (Notin et al 2023).…”
Section: Related Workmentioning
confidence: 99%