In
the context of bioactivity prediction, the question of how to
calibrate a score produced by a machine learning method into a probability
of binding to a protein target is not yet satisfactorily addressed.
In this study, we compared the performance of three such methods,
namely, Platt scaling (PS), isotonic regression (IR), and Venn–ABERS
predictors (VA), in calibrating prediction scores obtained from ligand–target
prediction comprising the Naïve Bayes, support vector machines,
and random forest (RF) algorithms. Calibration quality was assessed
on bioactivity data available at AstraZeneca for 40 million data points
(compound–target pairs) across 2112 targets and performance
was assessed using stratified shuffle split (SSS) and leave 20% of
scaffolds out (L20SO) validation. VA achieved the best calibration
performances across all machine learning algorithms and cross validation
methods tested and also the lowest (best) Brier score loss (mean squared
difference between the outputted probability estimates assigned to
a compound and the actual outcome). In comparison, the PS and IR methods
can actually degrade the assigned probability estimates, particularly
for the RF for SSS and during L20SO. Sphere exclusion, a method to
sample additional (putative) inactive compounds, was shown to inflate
the overall Brier score loss performance, through the artificial requirement for inactive molecules to be
dissimilar to active compounds, but was shown to result in overconfident
estimators. VA was able to successfully calibrate the probability
estimates for even small calibration sets. The multiprobability values
(lower and upper probability boundary intervals) were shown to produce
large discordance for test set molecules that are neither very similar
nor very dissimilar to the active training set, which were hence difficult
to predict, suggesting that multiprobability discordance can be used
as an estimate for target prediction uncertainty. Overall, we were
able to show in this work that VA scaling of target prediction models
is able to improve probability estimates in all testing instances
and is currently being applied for in-house approaches.