2022
DOI: 10.1021/acs.jctc.1c01257
|View full text |Cite
|
Sign up to set email alerts
|

Protein pKa Prediction by Tree-Based Machine Learning

Abstract: Protonation states of ionizable protein residues modulate many essential biological processes. For correct modeling and understanding of these processes, it is crucial to accurately determine their pK a values. Here, we present four tree-based machine learning models for protein pK a prediction. The four models, Random Forest, Extra Trees, eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), were trained on three experimental PDB and pK a datasets, two of which included a notabl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
42
0
2

Year Published

2022
2022
2024
2024

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 20 publications
(45 citation statements)
references
References 131 publications
1
42
0
2
Order By: Relevance
“…It also points to research projects that should be reconsidered. The richness of high quality data that are being compiled in databases (e.g., refs and − ) is already strengthening studies that require protein structures, such as mapping binding sites and interactions in signaling pathways, and identification of hot spots, including latent and rare cancer driver mutations. The most profound impact will likely be in accelerating and improving production of new medications (e.g., ref ), and in generating data that can be used toward this vital aim (e.g., refs , , , and ). AI developments and applications may further help foretell whether the signal propagating downstream will be strong enough to reach its genomic target to activate (suppress) gene expression, and predict pathways. Altogether, these powerful approaches and the databases that they create revamp and transform traditional and ongoing research involving the use of structures.…”
Section: Introductionmentioning
confidence: 99%
“…It also points to research projects that should be reconsidered. The richness of high quality data that are being compiled in databases (e.g., refs and − ) is already strengthening studies that require protein structures, such as mapping binding sites and interactions in signaling pathways, and identification of hot spots, including latent and rare cancer driver mutations. The most profound impact will likely be in accelerating and improving production of new medications (e.g., ref ), and in generating data that can be used toward this vital aim (e.g., refs , , , and ). AI developments and applications may further help foretell whether the signal propagating downstream will be strong enough to reach its genomic target to activate (suppress) gene expression, and predict pathways. Altogether, these powerful approaches and the databases that they create revamp and transform traditional and ongoing research involving the use of structures.…”
Section: Introductionmentioning
confidence: 99%
“…Both these issues contribute to the risk of model overfitting and poor generalizability. Chen et al trained tree-based machine learning models, such as XGBoost or LightGBM, on experimental data, and their best model exhibited an RMSE of 0.69 . To compare pKAI with these models and illustrate the data leakage problem at hand, we have refined our pKAI model by training it on same data split reported in ref .…”
Section: Discussionmentioning
confidence: 99%
“…As a comparison, in PROPKA3, only 85 experimental values of aspartate and glutamate residues were used to fit 6 parameters . Recently, traditional ML models have been trained on ∼1500 experimental p K a values. , However, testing the real-world performances of such methods is difficult, as there is a high degree of similarity among available experimental data. Our larger data set translates into more diversity in terms of protein and residue types and, more importantly, a wider variety of residue environments.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Fortunately, a rather successful empirical method for estimating pKas is available called PropKa ( Li et al, 2005 ); it uses a protein 3D structure and takes into account factors like burial, which tends to favor the neutral state and the proximity of other charged groups to calculate approximate pKa values. Thanks to AlphaFold 2, the availability of accurate 3D structures has now enabled the complete calculation of all titratable residues in the whole human proteome ( Chen et al, 2022 ).…”
Section: New Horizonsmentioning
confidence: 99%