Importance: Modern predictive models require large amounts of data for training and evaluation which can result in building models that are specific to certain locations, populations in them and clinical practices. Yet, best practices and guidelines for clinical risk prediction models have not yet considered such challenges to generalizability.
Objectives: To investigate changes in measures of predictive discrimination, calibration, and algorithmic fairness when transferring models for predicting in-hospital mortality across ICUs in different populations. Also, to study the reasons for the lack of generalizability in these measures.
Design, Setting, and Participants: In this multi-center cross-sectional study, electronic health records from 179 hospitals across the US with 70,126 hospitalizations were analyzed. Time of data collection ranged from 2014 to 2015.
Main Outcomes and Measures: The main outcome is in-hospital mortality. Generalization gap, defined as difference between model performance metrics across hospitals, is computed for discrimination and calibration metrics, namely area under the receiver operating characteristic curve (AUC) and calibration slope. To assess model performance by race variable, we report differences in false negative rates across groups. Data were also analyzed using a causal discovery algorithm "Fast Causal Inference" (FCI) that infers paths of causal influence while identifying potential influences associated with unmeasured variables.
Results: In-hospital mortality rates differed in the range of 3.9%-9.3% (1st-3rd quartile) across hospitals. When transferring models across hospitals, AUC at the test hospital ranged from 0.777 to 0.832 (1st to 3rd quartile; median 0.801); calibration slope from 0.725 to 0.983 (1st to 3rd quartile; median 0.853); and disparity in false negative rates from 0.046 to 0.168 (1st to 3rd quartile; median 0.092). When transferring models across geographies, AUC ranged from 0.795 to 0.813 (1st to 3rd quartile; median 0.804); calibration slope from 0.904 to 1.018 (1st to 3rd quartile; median 0.968); and disparity in false negative rates from 0.018 to 0.074 (1st to 3rd quartile; median 0.040). Distribution of all variable types (demography, vitals, and labs) differed significantly across hospitals and regions. Shifts in the race variable distribution and some clinical (vitals, labs and surgery) variables by hospital or region. Race variable also mediates differences in the relationship between clinical variables and mortality, by hospital/region.
Conclusions and Relevance: Group-specific metrics should be assessed during generalizability checks to identify potential harms to the groups. In order to develop methods to improve and guarantee performance of prediction models in new environments for groups and individuals, better understanding and provenance of health processes as well as data generating processes by sub-group are needed to identify and mitigate sources of variation.