The acid dissociation constant is an important molecular property and it can be successfully predicted by Quantitative Structure-Property Relationship (QSPR) models, even for in silico designed molecules. We analyzed how the methodology of in silico 3D structure preparation influences the quality of QSPR models. Specifically, we evaluated and compared QSPR models based on six different 3D structure sources (DTP NCI, Pubchem, Balloon, Frog2, OpenBabel and RDKit) combined with four different types of optimization. These analyses were performed for three classes of molecules (phenols, carboxylic acids, anilines) and the QSPR model descriptors were quantum mechanical (QM) and empirical partial atomic charges. Specifically, we developed 516 QSPR models and afterwards systematically analyzed the influence of the 3D structure source and other factors on their quality.
Our results confirmed that QSPR models based on partial atomic charges are able to predict pKa with high accuracy. We also confirmed that ab-initio and semiempirical QM charges provide very accurate QSPR models, and using empirical charges based on electronegativity equalization is also acceptable, as well as advantageous, since their calculation is very fast. On the other hand, Gasteiger-Marsili empirical charges are not applicable for pKa prediction. We later found that QSPR models for some classes of molecules (carboxylic acids) are less accurate. In this context, we compared the influence of different 3D structure sources. We found that an appropriate selection of 3D structure source and optimization method is essential for the successful QSPR modeling of pKa. Specifically, the 3D structures from the DTP NCI and Pubchem databases performed the best, as they provided very accurate QSPR models for all the tested molecular classes and charge calculation approaches, and they do not require optimization. Also Frog2 performed very well. Other 3D structure sources can also be used, but are not so robust, and an unfortunate combination of molecular class and charge calculation approach can produce weak QSPR models. Additionally, these 3D structures generally need optimization in order to produce good quality QSPR models.