Prediction of Protein pKa with Representation Learning

Gökcan, Hatice; Isayev, Olexandr

doi:10.26434/chemrxiv-2021-tcn0f

Cited by 1 publication

(2 citation statements)

References 106 publications

(113 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The RMSE values are summarized in Table 4, which also shows RMSE values for five other pK a predictors: DelPhiPKa, a popular continuum electrostatic pK a prediction method; 122 PypKa, a python module calculating pK a values by continuum electrostatic method; 57 DeepKa, a deep-learning-based pK a predictor trained on pK a values derived from continuous constant-pH simulations; 27 pKAI, a deep learning model trained on pK a values calculated by PypKa; 104 and a pK a predictor based on deep representation learning and trained on experimental pK a values, which we will refer to as DRL. 28 Because DelPhiPKa and DeepKa only predict the pK a values of Asp, Glu, His and Lys (DEHK) residues, and PypKa and DRL only predict for Asp, Glu, His, Lys, and Tyr (DEHK + Y) residues, we also show DEHK and "DEHK + Y" RMSE values in Table 4. The DEHK RMSE of the XGB-WMa model is 0.63.…”

Section: ■ Discussionmentioning

confidence: 99%

“…104 Another protein pK a prediction paper from Gokcan and Isayev introduced a new empirical scheme based on deep representation learning that was trained on experimental pK a data. 28 We chose to use the prevalent treebased ML models in this work because of their robustness and well-known good performance on various tasks. We noticed that support vector machine and cascade deep forest could perform well on small datasets.…”

Section: ■ Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Protein pK_a Prediction by Tree-Based Machine Learning

Chen

Lee

Damjanović

et al. 2022

J. Chem. Theory Comput.

View full text Add to dashboard Cite

Protonation states of ionizable protein residues modulate many essential biological processes. For correct modeling and understanding of these processes, it is crucial to accurately determine their pK a values. Here, we present four tree-based machine learning models for protein pK a prediction. The four models, Random Forest, Extra Trees, eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), were trained on three experimental PDB and pK a datasets, two of which included a notable portion of internal residues. We observed similar performance among the four machine learning algorithms. The best model trained on the largest dataset performs 37% better than the widely used empirical pK a prediction tool PROPKA and 15% better than the published result from the pK a prediction method DelPhiPKa. The overall root-mean-square error (RMSE) for this model is 0.69, with surface and buried RMSE values being 0.56 and 0.78, respectively, considering six residue types (Asp, Glu, His, Lys, Cys, and Tyr), and 0.63 when considering Asp, Glu, His, and Lys only. We provide pK a predictions for proteins in human proteome from the AlphaFold Protein Structure Database and observed that 1% of Asp/Glu/Lys residues have highly shifted pK a values close to the physiological pH.

show abstract

Section: ■ Discussionmentioning

confidence: 99%

Section: ■ Introductionmentioning

confidence: 99%