More Than Accuracy: An Empirical Study of Consistency Between Performance and Interpretability

Du, Yun; Liang, Dong; Ma, Rong Quan; Ду, Сонглин; Yan, Yaping

doi:10.1007/978-3-031-20868-3_43

Cited by 3 publications

(1 citation statement)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Despite the impact of this quality dimension on model performance and actual reliability, still, relatively low attention has been paid to its comprehensive assessment, which is usually accomplished by means of a number of alternative metrics (chiefly among them, the Brier score [6] and the Expected Calibration Error (ECE) [16]), that, despite their popularity, present several shortcomings [19,21]. These mainly concern their interpretability [11,23] (in terms of nonlinear scales or measurand factors, as for the Brier score), consistency [19,21] (undermining comparisons and benchmarking) and comprehensiveness [3] (when they do not account for local calibration, that is for levels of calibration in the surroundings of relevant portions of the probability space or bins).…”

Section: Introductionmentioning

confidence: 99%

Towards a Rigorous Calibration Assessment Framework: Advancements in Metrics, Methods, and Use

Famiglini,

Campagner,

Cabitza

2023

Frontiers in Artificial Intelligence and Applications

View full text Add to dashboard Cite

Calibration is paramount in developing and validating Machine Learning models, particularly in sensitive domains such as medicine. Despite its significance, existing metrics to assess calibration have been found to have shortcomings in regard to their interpretation and theoretical properties. This article introduces a novel and comprehensive framework to assess the calibration of Machine and Deep Learning models that addresses the above limitations. The proposed framework is based on a modification of the Expected Calibration Error (ECE), called the Estimated Calibration Index (ECI), which grounds on and extends prior research. ECI was initially formulated for binary settings, and we adapted it to fit multiclass settings. ECI offers a more nuanced, both locally and globally, and informative measure of a model’s tendency towards over/underconfidence. The paper first outlines the issues related to the prevalent definitions of ECE, including potential biases that may arise in the evaluation of their measures. Then, we present the results of a series of experiments conducted to demonstrate the effectiveness of the proposed framework in supporting a more accurate understanding of a model’s calibration level. Additionally, we discuss how to address and potentially mitigate some biases in calibration assessment.

show abstract

Section: Introductionmentioning

confidence: 99%