2020
DOI: 10.1007/978-3-030-44584-3_36
|View full text |Cite
|
Sign up to set email alerts
|

Master Your Metrics with Calibration

Abstract: Machine learning models deployed in real-world applications are often evaluated with precision-based metrics such as F1-score or AUC-PR (Area Under the Curve of Precision Recall). Heavily dependent on the class prior, such metrics make it difficult to interpret the variation of a model's performance over different subpopulations/subperiods in a dataset. In this paper, we propose a way to calibrate the metrics so that they can be made invariant to the prior. We conduct a large number of experiments on balanced … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
18
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 28 publications
(18 citation statements)
references
References 13 publications
0
18
0
Order By: Relevance
“…The PR curve is plot of Recall (x) vs. Precision (y), and PR_AUC was calculated as reported previously [ 54 ]. This study used N = 3 to reduce the bias, and the values are represented as averages.…”
Section: Methodsmentioning
confidence: 99%
“…The PR curve is plot of Recall (x) vs. Precision (y), and PR_AUC was calculated as reported previously [ 54 ]. This study used N = 3 to reduce the bias, and the values are represented as averages.…”
Section: Methodsmentioning
confidence: 99%
“…The class priors varied among the folds' test splits and differed from those of the full dataset (all 10 subsets considered together). The effects of these variations on perceived performance were suppressed by calibrating [ 56 ] the assessments to correspond to the class prior of the full dataset. Calibrated results corresponding to the 10 folds for each model type were aggregated to facilitate making meaningful comparisons between the performance assessments of the different types of models.…”
Section: Methodsmentioning
confidence: 99%
“…Calibration refers to the comparison between predicted and observed results, whilst discrimination represents the degree of distinguishing those at higher risk of having an event from those at lower risk (Alba et al 2017). Calibration usually comprises accuracy, precision, R 2 or F 1 score (Siblini et al 2020). Discrimination risk could be assessed by the ROC and the AUC (Moons et al 2014).…”
Section: Records Excluded By Exclusion Criteria: N = 286mentioning
confidence: 99%