Are Traditional Neural Networks Well-Calibrated?

Johansson, Ulf; Gabrielsson, Patrick

doi:10.1109/ijcnn.2019.8851962

Cited by 4 publications

(3 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some applications scale the output 𝑓 (𝑥) to produce a discrete probability distribution, thus interpreting the values as "confidence" scores. However, these are difficult to calibrate [Guo et al 2017;Johansson and Gabrielsson 2019], and may be unreliable in practice. Safe ordering properties thus capture the relevant range of behaviors needed for non-relational safety, and we aim to construct neural networks that are provably safe according to such properties.…”

Section: Introductionmentioning

confidence: 99%

Self-Repairing Neural Networks: Provable Safety for Deep Networks via Dynamic Repair

Leino¹,

Fromherz²,

Mangal³

et al. 2021

Preprint

View full text Add to dashboard Cite

University and NASA AmesNeural networks are increasingly being deployed in contexts where safety is a critical concern. In this work, we propose a way to construct neural network classifiers that dynamically repair violations of non-relational safety constraints called safe ordering properties. Safe ordering properties relate requirements on the ordering of a network's output indices to conditions on their input, and are sufficient to express most useful notions of non-relational safety for classifiers. Our approach is based on a novel self-repairing layer, which provably yields safe outputs regardless of the characteristics of its input. We compose this layer with an existing network to construct a self-repairing network (SR-Net), and show that in addition to providing safe outputs, the SR-Net is guaranteed to preserve the accuracy of the original network. Notably, our approach is independent of the size and architecture of the network being repaired, depending only on the specified property and the dimension of the network's output; thus it is scalable to large state-of-the-art networks. We show that our approach can be implemented using vectorized computations that execute efficiently on a GPU, introducing run-time overhead of less than one millisecond on current hardware-even on large, widely-used networks containing hundreds of thousands of neurons and millions of parameters.

show abstract

Section: Introductionmentioning

confidence: 99%

Self-Repairing Neural Networks: Provable Safety for Deep Networks via Dynamic Repair

Leino¹,

Fromherz²,

Mangal³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…40,41 Deep neural networks are also more often poorly calibrated compared to a decade ago because of overfitting from increased application of depth, width, weight decay, and application of batch normalization techniques, which manifest in probabilistic error rather than classification error. 42,43 The posterior probabilities from the previous methods are hence often poor estimates of the actual likelihood of a positive bioactivity prediction if used directly in this way. 44,45 Despite this, the assessment of calibration performance receives little attention.…”

Section: ■ Introductionmentioning

confidence: 99%

“…Machine learning (algorithmic) behavior is a further factor that influences both the output probability range and distribution of the “raw” probability estimates generated. For example, support vector machines (SVMs) provide no direct support for probability estimates associated with every output prediction, and consequently require additional work to convert the decision function into interpretable probability estimates. , Naïve Bayes (NB) generates posterior probabilities populating extreme regions of the probability scale (very high or low values) because of repeated multiplications over conditional feature probabilities. , Conversely, random forests (RFs) bias predictions toward the midpoint when the predicted fraction of classes across the underlying trees is employed as probability estimates, and extreme values can only be achieved when an exceptionally high proportion of trees predicts either label. , Deep neural networks are also more often poorly calibrated compared to a decade ago because of overfitting from increased application of depth, width, weight decay, and application of batch normalization techniques, which manifest in probabilistic error rather than classification error. , …”

Section: Introductionmentioning

confidence: 99%

Comparison of Scaling Methods to Obtain Calibrated Probabilities of Activity for Protein–Ligand Predictions

Mervin

Afzal

Engkvist

et al. 2020

J. Chem. Inf. Model.

View full text Add to dashboard Cite

In the context of bioactivity prediction, the question of how to calibrate a score produced by a machine learning method into a probability of binding to a protein target is not yet satisfactorily addressed. In this study, we compared the performance of three such methods, namely, Platt scaling (PS), isotonic regression (IR), and Venn–ABERS predictors (VA), in calibrating prediction scores obtained from ligand–target prediction comprising the Naïve Bayes, support vector machines, and random forest (RF) algorithms. Calibration quality was assessed on bioactivity data available at AstraZeneca for 40 million data points (compound–target pairs) across 2112 targets and performance was assessed using stratified shuffle split (SSS) and leave 20% of scaffolds out (L20SO) validation. VA achieved the best calibration performances across all machine learning algorithms and cross validation methods tested and also the lowest (best) Brier score loss (mean squared difference between the outputted probability estimates assigned to a compound and the actual outcome). In comparison, the PS and IR methods can actually degrade the assigned probability estimates, particularly for the RF for SSS and during L20SO. Sphere exclusion, a method to sample additional (putative) inactive compounds, was shown to inflate the overall Brier score loss performance, through the artificial requirement for inactive molecules to be dissimilar to active compounds, but was shown to result in overconfident estimators. VA was able to successfully calibrate the probability estimates for even small calibration sets. The multiprobability values (lower and upper probability boundary intervals) were shown to produce large discordance for test set molecules that are neither very similar nor very dissimilar to the active training set, which were hence difficult to predict, suggesting that multiprobability discordance can be used as an estimate for target prediction uncertainty. Overall, we were able to show in this work that VA scaling of target prediction models is able to improve probability estimates in all testing instances and is currently being applied for in-house approaches.

show abstract

Conformal Prediction for Accuracy Guarantees in Classification with Reject Option

Johansson

Löfström

Sönströd

et al. 2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Are Traditional Neural Networks Well-Calibrated?

Cited by 4 publications

References 6 publications

Self-Repairing Neural Networks: Provable Safety for Deep Networks via Dynamic Repair

Self-Repairing Neural Networks: Provable Safety for Deep Networks via Dynamic Repair

Comparison of Scaling Methods to Obtain Calibrated Probabilities of Activity for Protein–Ligand Predictions

Conformal Prediction for Accuracy Guarantees in Classification with Reject Option

Contact Info

Product

Resources

About