Pred‐MutHTP: Prediction of disease‐causing and neutral mutations in human transmembrane proteins

Kulandaisamy, A.; Zaucha, Jan; Sakthivel, R.; Frishman, Dmitrij; Gromiha, M. Michael

doi:10.1002/humu.23961

Cited by 25 publications

(28 citation statements)

References 67 publications

(104 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Top prediction methods in computational biology [15], [16], [17], [18], [19], [20] combine machine learning (ML) and evolutionary information (EI), first established as the winning strategy to predict protein secondary structure [21], [22] in two steps. First, search for a family of related proteins summarized as multiple sequence alignment (MSA) and extract the evolutionary information contained in this align-ment.…”

Section: Introductionmentioning

confidence: 99%

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

Elnaggar

Heinzinger

Dallago

et al. 2022

IEEE Trans. Pattern Anal. Mach. Intell.

1,036

1,154

View full text Add to dashboard Cite

Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models taken from NLP. These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The LMs were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw protein LM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks. The first was a per-residue prediction of protein secondary structure (3-state accuracy Q3=81%-87%); the second were per-protein predictions of protein sub-cellular localization (ten-state accuracy: Q10=81%) and membrane vs. water-soluble (2-state accuracy Q2=91%). For the per-residue predictions the transfer of the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without using evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that protein LMs learned some of the grammar of the language of life. To facilitate future work, we released our models at https://github.com/agemagician/ProtTrans.

show abstract

Section: Introductionmentioning

confidence: 99%

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

Elnaggar

Heinzinger

Dallago

et al. 2022

IEEE Trans. Pattern Anal. Mach. Intell.

1,036

1,154

View full text Add to dashboard Cite

show abstract

Section: Introductionmentioning

confidence: 99%

ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Learning

Elnaggar

Heinzinger

Dallago

et al. 2020

Preprint

434

673

View full text Add to dashboard Cite

Motivation: Natural Language Processing (NLP) continues improving substantially through auto-regressive (AR) and auto-encoding (AE) Language Models (LMs). These LMs require expensive computing resources for self-supervised or un-supervised learning from huge unlabelled text corpora. The information learned is transferred through so-called embeddings to downstream prediction tasks. Computational biology and bioinformatics provide vast gold-mines of structured and sequentially ordered text data leading to extraordinarily successful protein sequence LMs that promise new frontiers for generative and predictive tasks at low inference cost. As recent NLP advances link corpus size to model size and accuracy, we addressed two questions: (1) To which extent can High-Performance Computing (HPC) up-scale protein LMs to larger databases and larger models? (2) To which extent can LMs extract features from single proteins to get closer to the performance of methods using evolutionary information? Methodology: Here, we trained two auto-regressive language models (Transformer-XL and XLNet) and two auto-encoder models (BERT and Albert) on 80 billion amino acids from 200 million protein sequences (UniRef100) and one language model (Transformer-XL) on 393 billion amino acids from 2.1 billion protein sequences taken from the Big Fat Database (BFD), today's largest set of protein sequences (corresponding to 22- and 112-times, respectively of the entire English Wikipedia). The LMs were trained on the Summit supercomputer, using 936 nodes with 6 GPUs each (in total 5616 GPUs) and one TPU Pod, using V3-512 cores. Results: We validated the feasibility of training big LMs on proteins and the advantage of up-scaling LMs to larger models supported by more data. The latter was assessed by predicting secondary structure in three- and eight-states (Q3=75-83, Q8=63-72), localization for 10 cellular compartments (Q10=74) and whether a protein is membrane-bound or water-soluble (Q2=89). Dimensionality reduction revealed that the LM-embeddings from unlabelled data (only protein sequences) captured important biophysical properties of the protein alphabet, namely the amino acids, and their well orchestrated interplay in governing the shape of proteins. In the analogy of NLP, this implied having learned some of the grammar of the language of life realized in protein sequences. The successful up-scaling of protein LMs through HPC slightly reduced the gap between models trained on evolutionary information and LMs. Additionally, our results highlighted the importance of bi-directionality when processing proteins as the uni-directional TransformerXL was outperformed by its bi-directional counterparts. Availability ProtTrans: <a href="https://github.com/agemagician/ProtTrans">https://github.com/agemagician/ProtTrans</a>

show abstract

“…A handful of popular methods are available to predict the effect of variations 45,46 or to highlight vulnerable regions in proteins 47,48 , yet most of these are based on purely statistical approaches. Methods incorporating structural information are largely limited to general features of PDB structures, or prediction of transmembrane domains or disordered segments, although no currently available methods incorporate features of coiled-coils.…”

Section: Resultsmentioning

confidence: 99%

Distribution of disease-causing germline mutations in coiled-coils implies an important role of their N-terminal region

Kálmán

Mészáros

Gáspári

et al. 2020

Sci Rep

View full text Add to dashboard Cite

Next-generation sequencing resulted in the identification of a huge number of naturally occurring variations in human proteins. The correct interpretation of the functional effects of these variations necessitates the understanding of how they modulate protein structure. Coiled-coils are α-helical structures responsible for a diverse range of functions, but most importantly, they facilitate the structural organization of macromolecular scaffolds via oligomerization. In this study, we analyzed a comprehensive set of disease-associated germline mutations in coiled-coil structures. Our results suggest an important role of residues near the N-terminal part of coiled-coil regions, possibly critical for superhelix assembly and folding in some cases. We also show that coiled-coils of different oligomerization states exhibit characteristically distinct patterns of disease-causing mutations. Our study provides structural and functional explanations on how disease emerges through the mutation of these structural motifs.

show abstract

Pred‐MutHTP: Prediction of disease‐causing and neutral mutations in human transmembrane proteins

Cited by 25 publications

References 67 publications

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning

ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Learning

Distribution of disease-causing germline mutations in coiled-coils implies an important role of their N-terminal region

Contact Info

Product

Resources

About