Determining the aqueous
solubility of molecules is a vital step
in many pharmaceutical, environmental, and energy storage applications.
Despite efforts made over decades, there are still challenges associated
with developing a solubility prediction model with satisfactory accuracy
for many of these applications. The goals of this study are to assess
current deep learning methods for solubility prediction, develop a
general model capable of predicting the solubility of a broad range
of organic molecules, and to understand the impact of data properties,
molecular representation, and modeling architecture on predictive
performance. Using the largest currently available solubility data
set, we implement deep learning-based models to predict solubility
from the molecular structure and explore several different molecular
representations including molecular descriptors, simplified molecular-input
line-entry system strings, molecular graphs, and three-dimensional
atomic coordinates using four different neural network architectures—fully
connected neural networks, recurrent neural networks, graph neural
networks (GNNs), and SchNet. We find that models using molecular descriptors
achieve the best performance, with GNN models also achieving good
performance. We perform extensive error analysis to understand the
molecular properties that influence model performance, perform feature
analysis to understand which information about the molecular structure
is most valuable for prediction, and perform a transfer learning and
data size study to understand the impact of data availability on model
performance.