Background Machine learning algorithms are currently used in a wide array of clinical domains to produce models that can predict clinical risk events. Most models are developed and evaluated with retrospective data, very few are evaluated in a clinical workflow, and even fewer report performances in different hospitals. In this study, we provide detailed evaluations of clinical risk prediction models in live clinical workflows for three different use cases in three different hospitals. Objective The main objective of this study was to evaluate clinical risk prediction models in live clinical workflows and compare their performance in these setting with their performance when using retrospective data. We also aimed at generalizing the results by applying our investigation to three different use cases in three different hospitals. Methods We trained clinical risk prediction models for three use cases (ie, delirium, sepsis, and acute kidney injury) in three different hospitals with retrospective data. We used machine learning and, specifically, deep learning to train models that were based on the Transformer model. The models were trained using a calibration tool that is common for all hospitals and use cases. The models had a common design but were calibrated using each hospital’s specific data. The models were deployed in these three hospitals and used in daily clinical practice. The predictions made by these models were logged and correlated with the diagnosis at discharge. We compared their performance with evaluations on retrospective data and conducted cross-hospital evaluations. Results The performance of the prediction models with data from live clinical workflows was similar to the performance with retrospective data. The average value of the area under the receiver operating characteristic curve (AUROC) decreased slightly by 0.6 percentage points (from 94.8% to 94.2% at discharge). The cross-hospital evaluations exhibited severely reduced performance: the average AUROC decreased by 8 percentage points (from 94.2% to 86.3% at discharge), which indicates the importance of model calibration with data from the deployment hospital. Conclusions Calibrating the prediction model with data from different deployment hospitals led to good performance in live settings. The performance degradation in the cross-hospital evaluation identified limitations in developing a generic model for different hospitals. Designing a generic process for model development to generate specialized prediction models for each hospital guarantees model performance in different hospitals.
BACKGROUND Postoperative delirium is a highly relevant complication of cardiac surgery. It is associated with worse outcomes and considerably increased costs of care. A novel approach of monitoring patients with machine learning enabled prediction software could trigger pre-emptive implementation of mitigation strategies as well as timely intervention. OBJECTIVE This study evaluates the predictive accuracy of an artificial intelligence (AI) model for anticipating postoperative delirium by comparing it to established standards and measures of risk and vulnerability. DESIGN Retrospective predictive accuracy study. SETTING Records were gathered from a database for anaesthesia quality assurance at a specialised heart surgery centre in Germany. PATIENTS Between January and July 2021, 131 patients had been enrolled into the database and had data available for AI prediction modelling. After exclusion of incomplete follow-ups, a subset of 114 was included in the statistical analysis. MAIN OUTCOME MEASURES Delirium was diagnosed with the Confusion Assessment Method for the ICU (CAM-ICU) over three days postoperatively with specific follow-up visits. AI predictions were also compared with risk assessment through a frailty screening, a Shulman Clock Drawing Test, and using a checklist of predisposing factors including comorbidity, reduced mobility, and substance abuse. RESULTS Postoperative delirium was diagnosed in 23.7% of patients. Postoperative AI screening exhibited reasonable performance with an area under the receiver operating curve (AUROC) of 0.79, 95% confidence interval (CI), 0.69–0.87. But pre-operative prediction was weak for all methods (AUROC range from 0.55 to 0.66). There were significant associations with postoperative delirium: open heart surgery versus endovascular valve replacement (33.3% vs. 10.4%, P < 0.01), postinterventional hospitalisation (12.8 vs. 8.6 days, P < 0.01), and length of ICU stay (1.7 vs. 0.3 days, P < 0.01) were all significantly associated with postoperative delirium. CONCLUSION AI is a promising approach with considerable potential and delivered noninferior results compared with the usual approach of structured evaluation of risk factors and questionnaires. Since these established methods do not provide the desired confidence level, improved AI may soon deliver a better performance. TRIAL REGISTRATION None.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.