2022
DOI: 10.1101/2022.03.29.486219
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Protein language models trained on multiple sequence alignments learn phylogenetic relationships

Abstract: Self-supervised neural language models with attention have recently been applied to biological sequence data, advancing structure, function and mutational effect prediction. Some protein language models, including MSA Transformer and AlphaFold's EvoFormer, take multiple sequence alignments (MSAs) of evolutionarily related proteins as inputs. Simple combinations of MSA Transformer's row attentions have led to state-of-the-art unsupervised structural contact prediction. We demonstrate that similarly simple, and … Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

4
12
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
1

Relationship

3
3

Authors

Journals

citations
Cited by 6 publications
(16 citation statements)
references
References 62 publications
4
12
0
Order By: Relevance
“…We showed in another work that they are even more resilient to phylogenetic noise than Potts model [43]. Taken together, the present results and those of [43] demonstrate the crucial importance of disentangling phylogenetic correlations from functional ones for contact prediction performance. While phylogenetic correlations are an issue for structure prediction, they are nevertheless an important and helpful signal for the inference of interaction partners among the paralogs of two protein families [32, 44, 45].…”
Section: Discussionsupporting
confidence: 80%
“…We showed in another work that they are even more resilient to phylogenetic noise than Potts model [43]. Taken together, the present results and those of [43] demonstrate the crucial importance of disentangling phylogenetic correlations from functional ones for contact prediction performance. While phylogenetic correlations are an issue for structure prediction, they are nevertheless an important and helpful signal for the inference of interaction partners among the paralogs of two protein families [32, 44, 45].…”
Section: Discussionsupporting
confidence: 80%
“…The coevolution-driven approach was recently experimentally validated in the case of bmDCA Potts models, which capture pairwise coevolution patterns in MSAs [13], and for variational autoencoders [14,15]. Protein language models trained on multiple sequence alignments provide state-of-the-art unsupervised contact prediction and are able to capture coevolutionary patterns in their tied row attentions [25], and capture phylogenetic relationships in column attentions [53]. This makes them ideal candidates to generate new protein sequences from given families.…”
Section: Discussionmentioning
confidence: 99%
“…To sample independent equilibrium sequences from Potts models, we used the strategy described in Ref. [53]. Specifically, we fitted Potts models on each of our natural MSAs using bmDCA [12] (https://github.com/ranganathanlab/bmDCA) with its default hyperparameters.…”
Section: Sampling Sequences From Potts Modelsmentioning
confidence: 99%
“…These alignments are the same as in Ref. [53], except that we discarded PF13354 (Beta-lactamase2) because of its smaller depth.…”
Section: Datasetsmentioning
confidence: 99%