Lipophilicity, as measured by the partition coefficient
between
octanol and water (log P), is a key parameter in
early drug discovery research. However, measuring log P experimentally is difficult for specific compounds and log P ranges. The resulting lack of reliable experimental data
impedes development of accurate in silico models for such compounds.
In certain discovery projects at Novartis focused on such compounds,
a quantum mechanics (QM)-based tool for log P estimation
has emerged as a valuable supplement to experimental measurements
and as a preferred alternative to existing empirical models. However,
this QM-based approach incurs a substantial computational cost, limiting
its applicability to small series and prohibiting quick, interactive
ideation. This work explores a set of machine learning models (Random
Forest, Lasso, XGBoost, Chemprop, and Chemprop3D) to learn calculated
log P values on both a public data set and an in-house
data set to obtain a computationally affordable, QM-based estimation
of drug lipophilicity. The message-passing neural network model Chemprop
emerged as the best performing model with mean absolute errors of
0.44 and 0.34 log units for scaffold split test sets of the public
and in-house data sets, respectively. Analysis of learning curves
suggests that a further decrease in the test set error can be achieved
by increasing the training set size. While models directly trained
on experimental data perform better at approximating experimentally
determined log P values than models trained on calculated
values, we discuss the potential advantages of using calculated log P values going beyond the limits of experimental quantitation.
We analyze the impact of the data set splitting strategy and gain
insights into model failure modes. Potential use cases for the presented
models include pre-screening of large compound collections and prioritization
of compounds for full QM calculations.