The Statistical Assessment of Modeling of Proteins and Ligands (SAMPL) challenges focuses the computational modeling community on areas in need of improvement for rational drug design. The SAMPL7 physical property challenge dealt with prediction of octanol-water partition coefficients and pKa for 22 compounds. The dataset was composed of a series of N-acylsulfonamides and related bioisosteres. 17 research groups participated in the log P challenge, submitting 33 blind submissions total. For the pKa challenge, 7 different groups participated, submitting 9 blind submissions in total. Overall, the accuracy of octanol-water log P predictions in the SAMPL7 challenge was lower than octanol-water log P predictions in SAMPL6, likely due to a more diverse dataset. Compared to the SAMPL6 pKa challenge, accuracy remains unchanged in SAMPL7. Interestingly, here, though macroscopic pKa values were often predicted with reasonable accuracy, there was dramatically more disagreement among participants as to which microscopic transitions produced these values (with methods often disagreeing even as to the sign of the free energy change associated with certain transitions), indicating far more work needs to be done on pKa prediction methods.
The "embedded cluster reference interaction site model" (EC-RISM) integral equation theory is applied to the problem of predicting aqueous pK values for drug-like molecules based on an ensemble of tautomers. EC-RISM is based on self-consistent calculations of a solute's electronic structure and the distribution function of surrounding water. Following-up on the workflow developed after the SAMPL5 challenge on cyclohexane-water distribution coefficients we extended and improved the methodology by taking into account exact electrostatic solute-solvent interactions taken from the wave function in solution. As before, the model is calibrated against Gibbs energies of hydration from the "Minnesota Solvation Database" and a public dataset of acidity constants of organic acids and bases by adjusting in total 4 parameters, among which only 3 are relevant for predicting pK values. While the best-performing training model yields a root-mean-square error (RMSE) of 1 pK unit, the corresponding test set prediction on the full SAMPL6 dataset of macroscopic pK values using the same level of theory exhibits slightly larger error (1.7 pK units) than the best test set model submitted (1.7 pK units for corresponding training set vs. test set performance of 1.6). Post-submission analysis revealed a number of physical optimization options regarding the numerical treatment of electrostatic interactions and conformational sampling. While the experimental test set data revealed after submission was not used for reparametrizing the methodology, the best physically optimized models consequentially result in RMSEs of 1.5 if only improved electrostatic interactions are considered and of 1.1 if, in addition, conformational sampling accounts for quantum-chemically derived rankings. We conclude that these numbers are probably near the ultimate accuracy achievable with the simple 3-parameter model using a single or the two best-ranking conformations per tautomer or microstate. Finally, relations of the present macrostate approach to microstate pK results are discussed and some illustrative results for microstate populations are presented.
We predict cyclohexane-water distribution coefficients (log D ) for drug-like molecules taken from the SAMPL5 blind prediction challenge by the "embedded cluster reference interaction site model" (EC-RISM) integral equation theory. This task involves the coupled problem of predicting both partition coefficients (log P) of neutral species between the solvents and aqueous acidity constants (pK) in order to account for a change of protonation states. The first issue is addressed by calibrating an EC-RISM-based model for solvation free energies derived from the "Minnesota Solvation Database" (MNSOL) for both water and cyclohexane utilizing a correction based on the partial molar volume, yielding a root mean square error (RMSE) of 2.4 kcal mol for water and 0.8-0.9 kcal mol for cyclohexane depending on the parametrization. The second one is treated by employing on one hand an empirical pK model (MoKa) and, on the other hand, an EC-RISM-derived regression of published acidity constants (RMSE of 1.5 for a single model covering acids and bases). In total, at most 8 adjustable parameters are necessary (2-3 for each solvent and two for the pK) for training solvation and acidity models. Applying the final models to the log D dataset corresponds to evaluating an independent test set comprising other, composite observables, yielding, for different cyclohexane parametrizations, 2.0-2.1 for the RMSE with the first and 2.2-2.8 with the combined first and second SAMPL5 data set batches. Notably, a pure log P model (assuming neutral species only) performs statistically similarly for these particular compounds. The nature of the approximations and possible perspectives for future developments are discussed.
Results are reported for octanol-water partition coefficients (log P) of the neutral states of drug-like molecules provided during the SAMPL6 (Statistical Assessment of Modeling of Proteins and Ligands) blind prediction challenge from applying the "embedded cluster reference interaction site model" (EC-RISM) as a solvation model for quantum-chemical calculations. Following the strategy outlined during earlier SAMPL challenges we first train 1-and 2-parameter water-free ("dry") and water-saturated ("wet") models for n-octanol solvation Gibbs energies with respect to experimental values from the "Minnesota Solvation Database" (MNSOL), yielding a root mean square error (RMSE) of 1.5 kcal mol −1 for the best-performing 2-parameter wet model, while the optimal water model developed for the pK a part of the SAMPL6 challenge is kept unchanged (RMSE 1.6 kcal mol −1 for neutral compounds from a model trained on both neutral and ionic species). Applying these models to the blind prediction set yields a log P RMSE of less than 0.5 for our best model (2-parameters, wet). Further analysis of our results reveals that a single compound is responsible for most of the error, SM15, without which the RMSE drops to 0.2. Since this is the only compound in the challenge dataset with a hydroxyl group we investigate other alcohols for which Gibbs energy of solvation data for both water and n-octanol are available in the MNSOL database to demonstrate a systematic cause of error and to discuss strategies for improvement.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.