Embeddings from protein language models predict conservation and variant effects

Marquet, Céline; Heinzinger, Michael; Olenyi, Tobias; Dallago, Christian; Bernhofer, Michael; Erckert, Kyra; Rost, Burkhard

doi:10.21203/rs.3.rs-584804/v1

Cited by 7 publications

(3 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We also experimented with deeper/more sophisticated networks without any gain from more free parameters (data not shown). This confirmed previous findings that simple networks suffice when inputting advanced embeddings ( 37 , 38 , 52 , 71 ). As the network was trained using contrastive learning, no final classification layer was needed.…”

Section: Methodssupporting

confidence: 91%

Contrastive learning on protein embeddings enlightens midnight zone

Heinzinger

Littmann

Sillitoe

et al. 2022

NAR Genomics and Bioinformatics

Self Cite

View full text Add to dashboard Cite

Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed ProtTucker, has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the ‘midnight zone’ of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT.

show abstract

Section: Methodssupporting

confidence: 91%

Contrastive learning on protein embeddings enlightens midnight zone

Heinzinger

Littmann

Sillitoe

et al. 2022

NAR Genomics and Bioinformatics

Self Cite

View full text Add to dashboard Cite

show abstract

“…We also experimented with deeper/more sophisticated networks without any gain from more free parameters (data not shown). This confirmed previous findings that simple networks suffice when inputting advanced embeddings (9,12,62,75). As the network was trained using contrastive learning, no final classification layer was needed.…”

Section: Methodssupporting

confidence: 91%

Contrastive learning on protein embeddings enlightens midnight zone

Heinzinger

Littmann

Sillitoe

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Thanks to the recent advances in protein three-dimensional (3D) structure prediction, in particular through AlphaFold 2 and RoseTTAFold, the abundance of protein 3D information will explode over the next year(s). Expert resources based on 3D structures such as SCOP and CATH have been organizing the complex sequence-structure-function relations into a hierarchical classification schema. Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI) transferring annotations from a protein with experimentally known annotation to a query without annotation. Here, we presented a novel approach that expands the concept of HBI from a low-dimensional sequence-distance lookup to the level of a high-dimensional embedding-based annotation transfer (EAT). Secondly, we introduced a novel solution using single protein sequence representations from protein Language Models (pLMs), so called embeddings (Prose, ESM-1b, ProtBERT, and ProtT5), as input to contrastive learning, by which a new set of embeddings was created that optimized constraints captured by hierarchical classifications of protein 3D structures. These new embeddings (dubbed ProtTucker) clearly improved what was historically referred to as threading or fold recognition. Thereby, the new embeddings enabled the intrusion into the midnight zone of protein comparisons, i.e., the region in which the level of pairwise sequence similarity is akin of random relations and therefore is hard to navigate by HBI methods. Cautious benchmarking showed that ProtTucker reached much further than advanced sequence comparisons without the need to compute alignments allowing it to be orders of magnitude faster. Code is available at https://github.com/Rostlab/EAT.

show abstract

“…We chose ProtT5 over other embedding models, such as ESM-1b (36), based on our experience with the model and comparisons during previous projects (34,38). Furthermore, ProtT5 does not require splitting long sequences, which might remove valuable global context information, while ESM-1b can only handle sequences of up to 1022 residues.…”

Section: Methodsmentioning

confidence: 99%

TMbed – Transmembrane proteins predicted through Language Model embeddings

Bernhofer

Rost

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Background: Despite the immense importance of transmembrane proteins (TMP) for molecular biology and medicine, experimental 3D structures for TMPs remain about 4-5 times underrepresented compared to non-TMPs. Today's top methods can accurately predict structures for many, but the annotations of the transmembrane regions remains a limiting step for proteome-wide predictions. Results: Here, we present a novel method, dubbed TMbed. Inputting embeddings from protein Language Models (in particular ProtT5), TMbed completes predictions of alpha helical and beta barrel TMPs for entire proteomes within hours on a single consumer-grade desktop machine at performance levels similar or better than methods, which are using evolutionary information (extracted from family alignments). On the per-protein level, TMbed correctly identified 61 of the 65 beta-barrel TMPs (94±7%) and 579 of the 593 alpha-helical TMPs (98±1%) in a non-redundant data set, at false positive rates well below 1% (erred on 31 of 5859 non-membrane proteins). On the per-segment level, TMbed correctly placed, on average, 9 of 10 transmembrane segments within five residues of the experimental observation. Although limited by GPU memory, our method can handle sequence of up to 4200 residues on standard graphics cards used in common desktop PCs (e.g., NVIDIA GeForce RTX 3060). Conclusions: TMbed accurately predicts alpha helical and beta barrel TMPs. Utilizing protein Language Models and GPU acceleration it can predict the human in less than an hour. Availability: Our code, method, and data sets are freely available in the GitHub repository, https://github.com/BernhoferM/TMbed

show abstract

Embeddings from protein language models predict conservation and variant effects

Cited by 7 publications

References 52 publications

Contrastive learning on protein embeddings enlightens midnight zone

Contrastive learning on protein embeddings enlightens midnight zone

Contrastive learning on protein embeddings enlightens midnight zone

TMbed – Transmembrane proteins predicted through Language Model embeddings

Contact Info

Product

Resources

About