Improving protein succinylation sites prediction using embeddings from protein language model

Pokharel, Suresh; Pratyush, Pawel; Heinzinger, Michael; Newman, Robert H.; Kc, Dukka B.

doi:10.1038/s41598-022-21366-2

Cited by 42 publications

(38 citation statements)

References 43 publications

(53 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In that regard, amino acids (words) can be represented by dense vectors using word embeddings where a vector represents the projection of the amino acid into a continuous vector space. We used Keras’s embedding layer [ 30 ], as in LMSuccSite [ 27 ], to implement supervised word embedding where the embedding is learned as a part of training a deep learning model. The process of parameter learning in this approach is supervised; the parameters are updated with subsequent layers during the learning process under the supervision of a label.…”

Section: Methodsmentioning

confidence: 99%

“…Recently, these embeddings have been shown to be beneficial in various structural bioinformatics tasks including but not limited to secondary structure prediction and subcellular location, among others. In that regard, in this work, we use pLM ProtT5 [ 21 , 27 ] as a static feature encoders to extract per residue embeddings for protein sequences for which we are predicting S-nitrosylation sites. It is relevant to note that the input to ProtT5 is the overall protein sequence.…”

Section: Methodsmentioning

confidence: 99%

“…Using ProtT5, the per-residue embeddings were extracted from the last hidden layer of the encoder model with the size of L x1024, where L is the size of the protein using the overall protein sequence as the input. As suggested by ProtTrans [ 26 ], LMSuccSite [ 27 ], the encoder side of ProtT5 was used, and embeddings were extracted in half-precision. For our purpose, as the per-residue embeddings are a contextualized representation, we only used the 1024 length embeddings for the site of interrogation (aka cystine ‘C’).…”

Section: Methodsmentioning

confidence: 99%

“…CNNs are less computationally intensive models than sequence-oriented models and facilitate the training of deeper networks as significantly fewer parameters are needed to be learned. The usage of CNNs is prevalent in several PTM prediction tasks [ 13 , 15 , 27 ]. In our case, we use CNN to process the feature representation of the protein sequence obtained from the word embedding layer as described in the previous section.…”

Section: Methodsmentioning

confidence: 99%

“…These advances are now being explored in proteins through the development of various protein language models (pLMs) [ 21 – 24 ]. The representations (embeddings) extracted from these transformer-based language models have been successful for various downstream bioinformatics prediction tasks [ 25 – 27 ], suggesting that the huge amount of information learned by these pLMs can be transferred to other tasks by extracting embeddings from these pLMs and using these embeddings as an input to predict other properties of protein.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

pLMSNOSite: an ensemble-based approach for predicting protein S-nitrosylation sites by integrating supervised word embedding and embedding from pre-trained protein language model

et al. 2023

Self Cite

View full text Add to dashboard Cite

Background Protein S-nitrosylation (SNO) plays a key role in transferring nitric oxide-mediated signals in both animals and plants and has emerged as an important mechanism for regulating protein functions and cell signaling of all main classes of protein. It is involved in several biological processes including immune response, protein stability, transcription regulation, post translational regulation, DNA damage repair, redox regulation, and is an emerging paradigm of redox signaling for protection against oxidative stress. The development of robust computational tools to predict protein SNO sites would contribute to further interpretation of the pathological and physiological mechanisms of SNO. Results Using an intermediate fusion-based stacked generalization approach, we integrated embeddings from supervised embedding layer and contextualized protein language model (ProtT5) and developed a tool called pLMSNOSite (protein language model-based SNO site predictor). On an independent test set of experimentally identified SNO sites, pLMSNOSite achieved values of 0.340, 0.735 and 0.773 for MCC, sensitivity and specificity respectively. These results show that pLMSNOSite performs better than the compared approaches for the prediction of S-nitrosylation sites. Conclusion Together, the experimental results suggest that pLMSNOSite achieves significant improvement in the prediction performance of S-nitrosylation sites and represents a robust computational approach for predicting protein S-nitrosylation sites. pLMSNOSite could be a useful resource for further elucidation of SNO and is publicly available at https://github.com/KCLabMTU/pLMSNOSite.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%