Calibration along with discrimination is an important measure of accuracy to validate predictive logistic regression models. Most predictive models in intensive care such as Simplified Acute Physiology Score (SAPS) II [1] and SAPS 3 [2,3] consider the binary outcome whether a patient will be alive or dead at hospital discharge. Discrimination measures how well the model can distinguish between patients who die and those who survive. Discrimination is usually assessed by the area under the receiver operating characteristic curve (AU-ROC) [4]. This statistic evaluates each pair of observations that have different outcomes and calculates the proportion of times when the patient who died had a higher predicted mortality than did the survivor. The AU-ROC ranges from 0.50 (no discrimination: complete binary random of 50 % similar to flipping a coin) to 1.00 (100 % correct discrimination of the model) [4].Calibration measures the model's ability to generate predictions that are on average close to the average observed outcome. Calibration has traditionally been approached in two steps. First, to investigate the overall ability to correctly relate the actual occurrence of the event to its estimated probability using statistical methods. The most widely used method is the HosmerLemeshow (H-L) test [5], which examines how well the percentage of observed deaths matches the percentage of predicted deaths over deciles of predicted risk. A p value greater than 0.05 is needed to conclude that there are no significant differences between the observed and expected outcomes and therefore the model has good overall calibration. Second, to localize possible deviations across risk strata by means of calibration plots with observed outcomes versus expected probabilities of mortality. The calibration plot, also named calibration curve, intends to provide complementary information over subsets. If the model calibrates well, there will not be a substantial deviation from the 45°line of perfect fit or bisector. On the contrary, miscalibration of the model will be a function of expected probability.The H-L test is easy to compute and its interpretation is intuitive, but it has acknowledged limitations such as being very sensitive to sample size [6][7][8]. The traditional plot or calibration curve also has some disadvantages: first, rather than a curve, it is a jagged line connecting the points in the plot; second, it is not accompanied by any information on the statistical significance of deviations from the bisector [9].Methods to measure the calibration of predictive models have remained unchanged for a long time, but new insights are now provided. In this issue of Intensive Care Medicine, Poole et al.[10] present a multicenter study aimed at comparing the performance of SAPS II and SAPS 3 in predicting hospital mortality. Discrimination was measured as usual using ROC curves and both models discriminated fairly. Interestingly, the calibration ability of the systems was measured using a new approach named calibration belt [9].Both syste...