HLAB: learning the BiLSTM features from the ProtBert-encoded proteins for the class I HLA-peptide binding prediction

Zhang, Yaqi; Zhu, Gancheng; Li, Kewei; Li, Fēi; Huang, Lan; Duan, Meiyu; Zhou, Fengfeng

doi:10.1093/bib/bbac173

Cited by 26 publications

(16 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The receiver operating characteristic curve (ROC) is a curve drawn according to a series of different classification methods (boundary value or decision threshold), with the true positive rate (sensitivity) as the ordinate and false positive rate (specificity) as the abscissa. ROC displays the relationship between true positives and false positives at different confidence levels [ 12 , 35 , 49 ]. Nevertheless, the ROC curve cannot clearly indicate which classifier is more superior.…”

Section: Methodsmentioning

confidence: 99%

“…With a global receptive field, BERT can effectively capture more global context information than the convolutional neural network-based models. Recently, BERT has achieved gratifying results in the prediction of various functional peptides, such as bitter peptides [ 33 ], antimicrobial peptides [ 34 ], and human leukocyte antigen peptides [ 35 ]. Soft symmetric alignment (SSA) has defined a brand-new method to compare arbitrary-length sequences within vectors [ 36 ].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

IUP-BERT: Identification of Umami Peptides Based on BERT Features

Jiang

Wang

et al. 2022

Foods

View full text Add to dashboard Cite

Umami is an important widely-used taste component of food seasoning. Umami peptides are specific structural peptides endowing foods with a favorable umami taste. Laboratory approaches used to identify umami peptides are time-consuming and labor-intensive, which are not feasible for rapid screening. Here, we developed a novel peptide sequence-based umami peptide predictor, namely iUP-BERT, which was based on the deep learning pretrained neural network feature extraction method. After optimization, a single deep representation learning feature encoding method (BERT: bidirectional encoder representations from transformer) in conjugation with the synthetic minority over-sampling technique (SMOTE) and support vector machine (SVM) methods was adopted for model creation to generate predicted probabilistic scores of potential umami peptides. Further extensive empirical experiments on cross-validation and an independent test showed that iUP-BERT outperformed the existing methods with improvements, highlighting its effectiveness and robustness. Finally, an open-access iUP-BERT web server was built. To our knowledge, this is the first efficient sequence-based umami predictor created based on a single deep-learning pretrained neural network feature extraction method. By predicting umami peptides, iUP-BERT can help in further research to improve the palatability of dietary supplements in the future.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

IUP-BERT: Identification of Umami Peptides Based on BERT Features

Jiang

Wang

et al. 2022

Foods

View full text Add to dashboard Cite

show abstract

“…It is well known that the position of amino acids in a peptide is essential information, which affects and even determines the spatial structure and function of a peptide. Models designed to process natural language generally have the ability to extract contextual information and have been applied to peptide processing [28][29][30]. Therefore, we explore whether CNN can be combined with a natural language processing model to bring better performance.…”

Section: Introductionmentioning

confidence: 99%

CcBHLA: pan-specific peptide–HLA class I binding prediction via Convolutional and BiLSTM features

Cao

et al. 2023

Preprint

View full text Add to dashboard Cite

Human major histocompatibility complex (MHC) proteins are encoded by the human leukocyte antigen (HLA) gene complex. When exogenous peptide fragments form peptide-HLA (pHLA) complexes with HLA molecules on the outer surface of cells, they can be recognized by T cells and trigger an immune response. Therefore, determining whether an HLA molecule can bind to a given peptide can improve the efficiency of vaccine design and facilitate the development of immunotherapy. This paper regards peptide fragments as natural language, we combine textCNN and BiLSTM to build a deep neural network model to encode the sequence features of HLA and peptides. Results on independent and external test datasets demonstrate that our CcBHLA model outperforms the state-of-the-art known methods in detecting HLA class I binding peptides. And the method is not limited by the HLA class I allele and the length of the peptide fragment. Users can download the model for binding peptide screening or retrain the model with private data on github (https://github.com/hongliangduan/CcBHLA-pan-specific-peptide-HLA-class-I-binding-prediction-via-Convolutional-and-BiLSTM-features.git).

show abstract

“…This nonsequential method of training could be relevant to collagen, where short-range (sequential) and long-range (nonsequential) interactions play a role in the structure. , The transformer framework has increasingly become the model of choice for NLP-type of problems in language and science applications and has most recently been used in AlphaFold 2 to predict protein structures. , While transformer models are powerful, since they can be generalized to a variety of applications and modalities (sequence regression problems, sequence to sequence translation, such as secondary structure prediction, and other needs including field predictions , ), they can also be difficult to train and often require large amounts of data. This has been exemplified in recent developments of very large language models based on these architectures, sometimes reaching hundreds of billions of parameters. − Further, to our best knowledge, while a few very recent examples exist of the application of these transformer models to predict the structure or binding properties of some other protein systems, − they have thus far not been used to directly predict biophysical properties of proteins.…”

Section: Introductionmentioning

confidence: 99%

CollagenTransformer: End-to-End Transformer Model to Predict Thermal Stability of Collagen Triple Helices Using an NLP Approach

Khare

González‐Obeso

Kaplan

et al. 2022

ACS Biomater. Sci. Eng.

View full text Add to dashboard Cite

Collagen is one of the most important structural proteins in biology, and its structural hierarchy plays a crucial role in many mechanically important biomaterials. Here, we demonstrate how transformer models can be used to predict, directly from the primary amino acid sequence, the thermal stability of collagen triple helices, measured via the melting temperature T m. We report two distinct transformer architectures to compare performance. First, we train a small transformer model from scratch, using our collagen data set featuring only 633 sequence-to-T m pairings. Second, we use a large pretrained transformer model, ProtBERT, and fine-tune it for a particular downstream task by utilizing sequence-to-T m pairings, using a deep convolutional network to translate natural language processing BERT embeddings into required features. Both the small transformer model and the fine-tuned ProtBERT model have similar R 2 values of test data (R 2 = 0.84 vs 0.79, respectively), but the ProtBERT is a much larger pretrained model that may not always be applicable for other biological or biomaterials questions. Specifically, we show that the small transformer model requires only 0.026% of the number of parameters compared to the much larger model but reaches almost the same accuracy for the test set. We compare the performance of both models against 71 newly published sequences for which T m has been obtained as a validation set and find reasonable agreement, with ProtBERT outperforming the small transformer model. The results presented here are, to our best knowledge, the first demonstration of the use of transformer models for relatively small data sets and for the prediction of specific biophysical properties of interest. We anticipate that the work presented here serves as a starting point for transformer models to be applied to other biophysical problems.

show abstract

HLAB: learning the BiLSTM features from the ProtBert-encoded proteins for the class I HLA-peptide binding prediction

Cited by 26 publications

References 46 publications

IUP-BERT: Identification of Umami Peptides Based on BERT Features

IUP-BERT: Identification of Umami Peptides Based on BERT Features

CcBHLA: pan-specific peptide–HLA class I binding prediction via Convolutional and BiLSTM features

CollagenTransformer: End-to-End Transformer Model to Predict Thermal Stability of Collagen Triple Helices Using an NLP Approach

Contact Info

Product

Resources

About