2021
DOI: 10.1038/s41467-021-26529-9
|View full text |Cite
|
Sign up to set email alerts
|

The generative capacity of probabilistic protein sequence models

Abstract: Potts models and variational autoencoders (VAEs) have recently gained popularity as generative protein sequence models (GPSMs) to explore fitness landscapes and predict mutation effects. Despite encouraging results, current model evaluation metrics leave unclear whether GPSMs faithfully reproduce the complex multi-residue mutational patterns observed in natural sequences due to epistasis. Here, we develop a set of sequence statistics to assess the “generative capacity” of three current GPSMs: the pairwise Pott… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
41
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
6
3

Relationship

3
6

Authors

Journals

citations
Cited by 30 publications
(42 citation statements)
references
References 71 publications
1
41
0
Order By: Relevance
“…We note that, without language models, analyzing the correlations in MSAs can reveal evolutionary relatedness and sub-families [15], as well as collective modes of correlation, some of which are phylogenetic and some functional [18]. Furthermore, Potts models capture the clustered organization of protein families in sequence space [26], and the latent space of variational autoencoder models trained on sequences [38][39][40] qualitatively captures phylogeny [39]. Here, we demonstrated the stronger result that detailed pairwise phylogenetic relationships between sequences are quantitatively learnt by MSA Transformer.…”
Section: Discussionmentioning
confidence: 99%
“…We note that, without language models, analyzing the correlations in MSAs can reveal evolutionary relatedness and sub-families [15], as well as collective modes of correlation, some of which are phylogenetic and some functional [18]. Furthermore, Potts models capture the clustered organization of protein families in sequence space [26], and the latent space of variational autoencoder models trained on sequences [38][39][40] qualitatively captures phylogeny [39]. Here, we demonstrated the stronger result that detailed pairwise phylogenetic relationships between sequences are quantitatively learnt by MSA Transformer.…”
Section: Discussionmentioning
confidence: 99%
“…We have employed a GPU implementation of the MCMC methodology, which makes it computationally tractable without resorting to more approximate inverse inference methods. The MCMC algorithm implemented on GPUs has been previously used to infer accurate Potts models in [ 17 , 23 , 35 , 38 , 54 , 67 ].…”
Section: Methodsmentioning
confidence: 99%
“…Potts model and threaded-energy calculation. Our Potts Hamiltonian was constructed from an MSA of protein kinase catalytic domains as previously described 67 . The Potts Hamiltonian ‫ܪ‬ሺܵሻ takes the form -…”
Section: Contact Frequency Differencesmentioning
confidence: 99%
“…2. An additional 20 type-I and type-I ½ ABFEs were calculated as part of our benchmarking procedure described in the Supplementary Information.Multiple Sequence Alignment (MSA) and Classification of Serine/Threonine vs Tyrosine Kinases.An MSA of 236,572 protein kinase catalytic domains with 259 columns was constructed as previously described 67. STKs and TKs were classified based on patterns of sequence conservation previously identified by Taylor and co-workers62 ; characteristic sequence features of TKs and STKs which form their respective phosphoacceptor binding pockets are found at the HRD+2 (Ala or Arg in TKs, Lys in STKs) and HRD+4 (Arg in TKs, variable in STKs) in the catalytic loop, as well as the APE-2 residue (Trp in TKs, variable in STKs) in the activation loop.…”
mentioning
confidence: 99%