2022
DOI: 10.1101/2022.04.14.488405
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Generative power of a protein language model trained on multiple sequence alignments

Abstract: Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly uses the masked language… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 102 publications
0
3
0
Order By: Relevance
“…More generally, our results suggest that the performance of protein language models trained on MSAs could be assessed by evaluating not only how well they capture structural contacts, but also how well they capture phylogenetic relationships. In addition, the ability of protein language models to learn phylogeny could make them particularly well-suited at generating synthetic MSAs capturing the data distribution of natural ones [50]. It also raises the question of their possible usefulness to infer phylogenies and evolutionary histories.…”
Section: Discussionmentioning
confidence: 99%
“…More generally, our results suggest that the performance of protein language models trained on MSAs could be assessed by evaluating not only how well they capture structural contacts, but also how well they capture phylogenetic relationships. In addition, the ability of protein language models to learn phylogeny could make them particularly well-suited at generating synthetic MSAs capturing the data distribution of natural ones [50]. It also raises the question of their possible usefulness to infer phylogenies and evolutionary histories.…”
Section: Discussionmentioning
confidence: 99%
“…et al [28] for an overview) may be suitable as well. On the other hand, it is unclear how well language model-based sequence generation protocols (see e.g., [66,67]) would perform, as they are usually trained and evaluated on full-length sequences rather than on fragments; here, they may lack key contextual information (e.g., that the fragments belong to disordered regions).…”
Section: Discussionmentioning
confidence: 99%
“…In pure sequence generation, protein language models (PLMs) can be conditioned by a known enzyme family to generate novel sequences with that function, without direct consideration of structure (Figure D). Models with transformer architectures have generated enzymes such as lysozymes, malate dehydrogenases, and chorismate mutases: for the best models, up to 80% of wet-lab validated sequences expressed and functioned. , Some of these generated sequences have low sequence identity (<40%) to known proteins and may be quite different from those explored by evolution, thus potentially unlocking combinations of properties not found in nature. Variational autoencoders (VAEs) have been used to generate phenylalanine hydroxylases and luciferases, with wet-lab validation achieving 30–80% success rates. ,, Generative adversarial networks (GANs) were also applied to the generation of malate dehydrogenases, with 24% success rate .…”
Section: Discovery Of Functional Enzymes With Machine Learningmentioning
confidence: 99%