Net benefit, calibration, threshold selection, and training objectives for algorithmic fairness in healthcare

Pfohl, Stephen R.; Xu, Yizhe; Foryciarz, Agata; Ignatiadis, Nikolaos; Genkins, Julian Z.; Shah, Nigam H.

doi:10.48550/arxiv.2202.01906

Cited by 2 publications

(2 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, model explainability remains a highly controversial topic among clinical and AI experts, with no universally accepted method for providing robust explanations for individual-level predictions. 33 Similarly, there is no clear consensus on the best strategy to incorporate algorithmic fairness considerations 34 , 35 ; therefore, APPRAISE-AI does not assign scores to any particular approach. Instead, the emphasis is placed on conducting bias assessments (item 17) so that researchers can examine the efficacy of their fairness strategies, regardless of the approach used.…”

Section: Discussionmentioning

confidence: 99%

APPRAISE-AI Tool for Quantitative Evaluation of AI Studies for Clinical Decision Support

Kwong,

Khondker,

Lajkosz

et al. 2023

JAMA Netw Open

View full text Add to dashboard Cite

ImportanceArtificial intelligence (AI) has gained considerable attention in health care, yet concerns have been raised around appropriate methods and fairness. Current AI reporting guidelines do not provide a means of quantifying overall quality of AI research, limiting their ability to compare models addressing the same clinical question.ObjectiveTo develop a tool (APPRAISE-AI) to evaluate the methodological and reporting quality of AI prediction models for clinical decision support.Design, Setting, and ParticipantsThis quality improvement study evaluated AI studies in the model development, silent, and clinical trial phases using the APPRAISE-AI tool, a quantitative method for evaluating quality of AI studies across 6 domains: clinical relevance, data quality, methodological conduct, robustness of results, reporting quality, and reproducibility. These domains included 24 items with a maximum overall score of 100 points. Points were assigned to each item, with higher points indicating stronger methodological or reporting quality. The tool was applied to a systematic review on machine learning to estimate sepsis that included articles published until September 13, 2019. Data analysis was performed from September to December 2022.Main Outcomes and MeasuresThe primary outcomes were interrater and intrarater reliability and the correlation between APPRAISE-AI scores and expert scores, 3-year citation rate, number of Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) low risk-of-bias domains, and overall adherence to the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) statement.ResultsA total of 28 studies were included. Overall APPRAISE-AI scores ranged from 33 (low quality) to 67 (high quality). Most studies were moderate quality. The 5 lowest scoring items included source of data, sample size calculation, bias assessment, error analysis, and transparency. Overall APPRAISE-AI scores were associated with expert scores (Spearman ρ, 0.82; 95% CI, 0.64-0.91; P &lt; .001), 3-year citation rate (Spearman ρ, 0.69; 95% CI, 0.43-0.85; P &lt; .001), number of QUADAS-2 low risk-of-bias domains (Spearman ρ, 0.56; 95% CI, 0.24-0.77; P = .002), and adherence to the TRIPOD statement (Spearman ρ, 0.87; 95% CI, 0.73-0.94; P &lt; .001). Intraclass correlation coefficient ranges for interrater and intrarater reliability were 0.74 to 1.00 for individual items, 0.81 to 0.99 for individual domains, and 0.91 to 0.98 for overall scores.Conclusions and RelevanceIn this quality improvement study, APPRAISE-AI demonstrated strong interrater and intrarater reliability and correlated well with several study quality measures. This tool may provide a quantitative approach for investigators, reviewers, editors, and funding organizations to compare the research quality across AI studies for clinical decision support.

show abstract

Section: Discussionmentioning

confidence: 99%

APPRAISE-AI Tool for Quantitative Evaluation of AI Studies for Clinical Decision Support

Kwong,

Khondker,

Lajkosz

et al. 2023

JAMA Netw Open

View full text Add to dashboard Cite

show abstract

“…The effect of discrimination on the Net Benefit was explored in [Van Calster et al, 2013, Vickers andElkin, 2006]. Recently, [Pfohl et al, 2022b, Pfohl et al, 2022a consider the impact of different fairness interventions on clinical utility.…”

Section: Further Related Workmentioning

confidence: 99%

Decision-Making under Miscalibration

Rothblum¹,

Yona²

2022

Preprint

View full text Add to dashboard Cite

ML-based predictions are used to inform consequential decisions about individuals. How should we use predictions (e.g., risk of heart attack) to inform downstream binary classification decisions (e.g., undergoing a medical procedure)? When the risk estimates are perfectly calibrated, the answer is well understood: a classification problem's cost structure induces an optimal treatment threshold j . In practice, however, some amount of miscalibration is unavoidable, raising a fundamental question: how should one use potentially miscalibrated predictions to inform binary decisions?We formalize a natural (distribution-free) solution concept: given anticipated miscalibration of α, we propose using the threshold j that minimizes the worst-case regret over all αmiscalibrated predictors, where the regret is the difference in clinical utility between using the threshold in question and using the optimal threshold in hindsight. We provide closed form expressions for j when miscalibration is measured using both expected and maximum calibration error, which reveal that it indeed differs from j (the optimal threshold under perfect calibration). We validate our theoretical findings on real data, demonstrating that there are natural cases in which making decisions using j improves the clinical utility.

show abstract

Net benefit, calibration, threshold selection, and training objectives for algorithmic fairness in healthcare

Cited by 2 publications

References 47 publications

APPRAISE-AI Tool for Quantitative Evaluation of AI Studies for Clinical Decision Support

APPRAISE-AI Tool for Quantitative Evaluation of AI Studies for Clinical Decision Support

Decision-Making under Miscalibration

Contact Info

Product

Resources

About