2022
DOI: 10.1101/2022.12.07.519495
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

TranceptEVE: Combining Family-specific and Family-agnostic Models of Protein Sequences for Improved Fitness Prediction

Abstract: Modeling the fitness landscape of protein sequences has historically relied on training models on family-specific sets of homologous sequences called Multiple Sequence Alignments. Many proteins are however difficult to align or have shallow alignments which limits the potential scope of alignment-based methods. Not subject to these limitations, large protein language models trained on non-aligned sequences across protein families have achieved increasingly high predictive performance – but have not yet fully b… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
206
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 75 publications
(208 citation statements)
references
References 31 publications
2
206
0
Order By: Relevance
“…Such dramatic changes in protein structure create challenges for predicting the effects of insertions from native structures computationally except in specific cases 71 . Deep learning approaches offer promise for predicting fitness directly from sequence [72][73][74] . However, they require larger insertion datasets for refinement and validation and to reveal principles governing insertion tolerance 75 .…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Such dramatic changes in protein structure create challenges for predicting the effects of insertions from native structures computationally except in specific cases 71 . Deep learning approaches offer promise for predicting fitness directly from sequence [72][73][74] . However, they require larger insertion datasets for refinement and validation and to reveal principles governing insertion tolerance 75 .…”
Section: Discussionmentioning
confidence: 99%
“…Deep learning approaches offer promise for predicting fitness directly from sequence [72][73][74] . However, they require larger insertion datasets for refinement and validation and to reveal principles governing insertion tolerance 75 . This can be contrasted with deep mutational scanning experiments that study the effects of amino acid substitutions 76 .…”
Section: Discussionmentioning
confidence: 99%
“…We have focused on state-of-the-art fitness estimation methods which are trained on data from individual protein families [44,50,19]. Recently, large-scale generative sequence models ("protein language models") trained on more diverse datasets (containing proteins from many different families) have show fitness estimation performance comparable to, and in some settings surpassing, single family models [36,40,41]. Although applying our diagnostic test to these datasets requires further work, there is no reason to expect the same limitations of density estimation do not hold for such models.…”
Section: Discussionmentioning
confidence: 99%
“…In this section we discuss the relevance and relationship of our results to large-scale "protein language models" such as ESM-1v [36], MSA Transformer [43], UniRep [1], Tranception [41], ProGen [33], ProGen2 [40] and others. Note the term "protein language model" is something of a misnomer; these methods are far from unique in applying and extending ideas from natural language processing (NLP) to build generative protein sequence models (Wavenet [50] and BEAR [2] being just two other examples).…”
Section: C3 Proof Of Theorem 41mentioning
confidence: 99%
“…For example, training a generative model on sequences from a target protein family has been used to generate functional variants (Costello and Martin 2019, Hawkins-Hooker et al 2021, Shin et al 2021. Unconditional sampling from protein language models trained on unaligned (Madani et al 2020, Ferruz et al 2022, Hesslow et al 2022, Lin et al 2022, Nijkamp et al 2022, Yang et al 2022 and aligned (Rao et al 2021, Notin et al 2022 sequences has recently been explored. While this is a promising new direction for protein engineering, our focus is on directed evolution from known proteins.…”
Section: Related Workmentioning
confidence: 99%