Aims
This study aimed to review the performance of machine learning (ML) methods compared with conventional statistical models (CSMs) for predicting readmission and mortality in patients with heart failure (HF) and to present an approach to formally evaluate the quality of studies using ML algorithms for prediction modelling.
Methods and results
Following Preferred Reporting Items for Systematic Reviews and Meta‐Analyses guidelines, we performed a systematic literature search using MEDLINE, EPUB, Cochrane CENTRAL, EMBASE, INSPEC, ACM Library, and Web of Science. Eligible studies included primary research articles published between January 2000 and July 2020 comparing ML and CSMs in mortality and readmission prognosis of initially hospitalized HF patients. Data were extracted and analysed by two independent reviewers. A modified CHARMS checklist was developed in consultation with ML and biostatistics experts for quality assessment and was utilized to evaluate studies for risk of bias. Of 4322 articles identified and screened by two independent reviewers, 172 were deemed eligible for a full‐text review. The final set comprised 20 articles and 686 842 patients. ML methods included random forests (n = 11), decision trees (n = 5), regression trees (n = 3), support vector machines (n = 9), neural networks (n = 12), and Bayesian techniques (n = 3). CSMs included logistic regression (n = 16), Cox regression (n = 3), or Poisson regression (n = 3). In 15 studies, readmission was examined at multiple time points ranging from 30 to 180 day readmission, with the majority of studies (n = 12) presenting prediction models for 30 day readmission outcomes. Of a total of 21 time‐point comparisons, ML‐derived c‐indices were higher than CSM‐derived c‐indices in 16 of the 21 comparisons. In seven studies, mortality was examined at 9 time points ranging from in‐hospital mortality to 1 year survival; of these nine, seven reported higher c‐indices using ML. Two of these seven studies reported survival analyses utilizing random survival forests in their ML prediction models. Both reported higher c‐indices when using ML compared with CSMs. A limitation of studies using ML techniques was that the majority were not externally validated, and calibration was rarely assessed. In the only study that was externally validated in a separate dataset, ML was superior to CSMs (c‐indices 0.913 vs. 0.835).
Conclusions
ML algorithms had better discrimination than CSMs in most studies aiming to predict risk of readmission and mortality in HF patients. Based on our review, there is a need for external validation of ML‐based studies of prediction modelling. We suggest that ML‐based studies should also be evaluated using clinical quality standards for prognosis research. Registration: PROSPERO CRD42020134867