Characterizing uncertainty
in machine learning models has recently
gained interest in the context of machine learning reliability, robustness,
safety, and active learning. Here, we separate the total uncertainty
into contributions from noise in the data (aleatoric) and shortcomings
of the model (epistemic), further dividing epistemic uncertainty into
model bias and variance contributions. We systematically address the
influence of noise, model bias, and model variance in the context
of chemical property predictions, where the diverse nature of target
properties and the vast chemical chemical space give rise to many
different distinct sources of prediction error. We demonstrate that
different sources of error can each be significant in different contexts
and must be individually addressed during model development. Through
controlled experiments on data sets of molecular properties, we show
important trends in model performance associated with the level of
noise in the data set, size of the data set, model architecture, molecule
representation, ensemble size, and data set splitting. In particular,
we show that 1) noise in the test set can limit a model’s observed
performance when the actual performance is much better, 2) using size-extensive
model aggregation structures is crucial for extensive property prediction,
and 3) ensembling is a reliable tool for uncertainty quantification
and improvement specifically for the contribution of model variance.
We develop general guidelines on how to improve an underperforming
model when falling into different uncertainty contexts.