Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets

Dallago, Christian; Schütze, Konstantin; Heinzinger, Michael; Olenyi, Tobias; Littmann, Maria; Lu, Amy X.; Yang, Kevin; Min, Shungeng; Yoon, Sungroh; Morton, James T.; Rost, Burkhard

doi:10.1002/cpz1.113

Cited by 88 publications

(81 citation statements)

References 68 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The data set including predictions for the human proteome, the source code, and the trained model are available via GitHub (https://github.com/Rostlab/bindPredict). Embeddings can be generated using the bio_embeddings pipeline 38 . In addition, bindEmbed21DL is publicly available as a standalone method as part of bio_embeddings.…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Protein embeddings and deep learning predict binding residues for various ligand classes

Littmann

Heinzinger

Dallago

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

One important aspect of protein function is the binding of proteins to ligands, including small molecules, metal ions, and macromolecules such as DNA or RNA. Despite decades of experimental progress many binding sites remain obscure. Here, we proposed bindEmbed21, a method predicting whether a protein residue binds to metal ions, nucleic acids, or small molecules. The Artificial Intelligence (AI)-based method exclusively uses embeddings from the Transformer-based protein Language Model ProtT5 as input. Using only single sequences without creating multiple sequence alignments (MSAs), bindEmbed21DL outperformed existing MSA-based methods. Combination with homology-based inference increased performance to F1=29±6%, F1=24±7%, and F1=41±% for metal ions, nucleic acids, and small molecules, respectively; it reached F1=45±2% when merging all three ligand classes into one. Focusing on very reliably predicted residues could complement experimental evidence: the 25% most strongly predicted binding residues, at least 73% were correctly predicted even when counting missing annotations as incorrect. The new method bindEmbed21 is fast, simple, and broadly applicable - neither using structure nor MSAs. Thereby, it found binding residues in over 42% of all human proteins not otherwise implied in binding.

show abstract

Section: Resultsmentioning

confidence: 99%

“…All data, the source code, and the trained model are available via GitHub (https://github.com/Rostlab/bindPredict). Embeddings can be generated using the bio_embeddings pipeline 40 . In addition, bindEmbed21 and its components bindEmbed21DL and bindEmbed21HBI are publicly available through bio_embeddings.…”

Section: Littmann Et Al and B Rostmentioning

confidence: 99%

Protein embeddings and deep learning predict binding residues for various ligand classes

Littmann

Heinzinger

Dallago

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Machine learning models have been used to solve biological problems such as predicting solubility of proteins, targeting subcellular localizations, folding and more [58][59][60][61][62][63]. In this paper, we develop a machine learning model that utilizes protein sequence information, which can classify residues in mechanically stable and unstable substructures.…”

Section: Discussionmentioning

confidence: 99%

Classifying Residues in Mechanically Stable and Unstable Substructures Based on a Protein Sequence: The Case Study of the DnaK Hsp70 Chaperone

Gala

2021

Nanomaterials

View full text Add to dashboard Cite

Artificial proteins can be constructed from stable substructures, whose stability is encoded in their protein sequence. Identifying stable protein substructures experimentally is the only available option at the moment because no suitable method exists to extract this information from a protein sequence. In previous research, we examined the mechanics of E. coli Hsp70 and found four mechanically stable (S class) and three unstable substructures (U class). Of the total 603 residues in the folded domains of Hsp70, 234 residues belong to one of four mechanically stable substructures, and 369 residues belong to one of three unstable substructures. Here our goal is to develop a machine learning model to categorize Hsp70 residues using sequence information. We applied three supervised methods: logistic regression (LR), random forest, and support vector machine. The LR method showed the highest accuracy, 0.925, to predict the correct class of a particular residue only when context-dependent physico-chemical features were included. The cross-validation of the LR model yielded a prediction accuracy of 0.879 and revealed that most of the misclassified residues lie at the borders between substructures. We foresee machine learning models being used to identify stable substructures as candidates for building blocks to engineer new proteins.

show abstract

“…All newly developed prediction methods exclusively used embeddings from pretrained protein LMs, namely from ProtBert (Elnaggar et al 2021) based on the NLP (Natural Language Processing) algorithm BERT (Devlin et al 2019) trained on the BFD database with over 2.3 million protein sequences (Steinegger and Söding 2018), and ProtT5-XL-U50 (Elnaggar et al 2021) (for simplicity referred to as ProtT5) based on the NLP method T5 (Raffel et al 2020) trained on BFD and fine-tuned on Uniref50 (The UniProt Consortium 2021). All embeddings were obtained from the bio_embeddings pipeline (Dallago et al 2021). The per-residue embeddings were extracted from the last hidden layer of the models with size 1024×L, where L is the length of the protein sequence and 1024 is the size of the embedding space of ProtBert and ProtT5.…”

Section: Methodsmentioning

confidence: 99%

“…Embeddings from pLMs. For conservation prediction, we used embeddings from the following pLMs: (Dallago et al 2021). As described in ProtTrans, only the encoder-side of ProtT5 was used and embeddings were extracted in halfprecision ).…”

Section: Input Featuresmentioning

confidence: 99%

Embeddings from protein language models predict conservation and variant effects

Marquet

Heinzinger

Olenyi

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret the effect of single amino acid variants (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue to expand our understanding of the mutational landscape of single proteins, the results continue to challenge analyses. Protein Language Models (pLMs) use the latest deep learning (DL) algorithms to leverage growing databases of protein sequences. These methods learn to predict missing or masked amino acids from the context of entire sequence regions. Here, we used pLM representations (embeddings) to predict sequence conservation and SAV effects without multiple sequence alignments (MSAs). Embeddings alone predicted residue conservation almost as accurately from single sequences as ConSeq using MSAs (two-state Matthew Correlation Coefficient – MCC - for ProtT5 embeddings of 0.596 ± 0.006 vs. 0.608 ± 0.006 for ConSeq). Inputting the conservation prediction along with BLOSUM62 substitution scores and pLM mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble for Variant Effect Scoring without alignments (VESPA) predicted SAV effect magnitude without any optimization on DMS data. Comparing predictions for a standard set of 39 DMS experiments to other methods (incl. ESM-1v, DeepSequence, and GEMME) revealed our approach as competitive with the state-of-the-art (SOTA) methods using MSA input. No method outperformed all others, neither consistently nor in a statistically significant manner, independently of the performance measure applied (incl. two-state accuracy: Q2, MCC, Spearman and Pearson correlation). Lastly, we investigated binary effect predictions on DMS experiments for four human proteins. Overall, embedding-based methods have become competitive with methods relying on MSAs for SAV effect prediction at a fraction of the costs in computing/energy. Our method predicted SAV effects for the entire human proteome (~ 20k proteins) within 40 minutes on one Nvidia Quadro RTX 8000. All methods and data sets are freely available for local and online execution through bioembeddings.com, https://github.com/Rostlab/VESPA, and PredictProtein.

show abstract

Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets

Cited by 88 publications

References 68 publications

Protein embeddings and deep learning predict binding residues for various ligand classes

Protein embeddings and deep learning predict binding residues for various ligand classes

Classifying Residues in Mechanically Stable and Unstable Substructures Based on a Protein Sequence: The Case Study of the DnaK Hsp70 Chaperone

Embeddings from protein language models predict conservation and variant effects

Contact Info

Product

Resources

About