Many challenges persist in developing accurate computational
models
for predicting solvation free energy (ΔG
sol). Despite recent developments in Machine Learning (ML)
methodologies that outperformed traditional quantum mechanical models,
several issues remain concerning explanatory insights for broad chemical
predictions with an acceptable speed–accuracy trade-off. To
overcome this, we present a novel supervised ML model to predict the
ΔG
sol for an array of solvent–solute
pairs. Using two different ensemble regressor algorithms, we made
fast and accurate property predictions using open-source chemical
features, encoding complex electronic, structural, and surface area
descriptors for every solvent and solute. By integrating molecular
properties and chemical interaction features, we have analyzed individual
descriptor importance and optimized our model though explanatory information
form feature groups. On aqueous and organic solvent databases, ML
models revealed the predictive relevance of solutes with increasing
polar surface area and decreasing polarizability, yielding better
results than state-of-the-art benchmark Neural Network methods (without
complex quantum mechanical or molecular dynamic simulations). Both
algorithms successfully outperformed previous ΔG
sol predictions methods, with a maximum absolute error
of 0.22 ± 0.02 kcal mol–1, further validated
in an external benchmark database and with solvent hold-out tests.
With these explanatory and statistical insights, they allow a thoughtful
application of this method for predicting other thermodynamic properties,
stressing the relevance of ML modeling for further complex computational
chemistry problems.