Active learning (AL)
has become a powerful tool in computational
drug discovery, enabling the identification of top binders from vast
molecular libraries. To design a robust AL protocol, it is important
to understand the influence of AL parameters, as well as the features
of the data sets on the outcomes. We use four affinity data sets for
different targets (TYK2, USP7, D2R, Mpro) to systematically evaluate
the performance of machine learning models [Gaussian process (GP)
model and Chemprop model], sample selection protocols, and the batch
size based on metrics describing the overall predictive power of the
model (R2, Spearman rank, root-mean-square error) as well as the accurate
identification of top 2%/5% binders (Recall, F1 score). Both models
have a comparable Recall of top binders on large data sets, but the
GP model surpasses the Chemprop model when training data are sparse.
A larger initial batch size, especially on diverse data sets, increased
the Recall of both models as well as overall correlation metrics.
However, for subsequent cycles, smaller batch sizes of 20 or 30 compounds
proved to be desirable. Furthermore, adding artificial Gaussian noise
to the data up to a certain threshold still allowed the model to identify
clusters with top-scoring compounds. However, excessive noise (<1σ)
did impact the model’s predictive and exploitative capabilities.