Background
The number of publications using machine learning (ML) to predict cardiovascular outcomes and identify clusters of patients at greater risk has risen dramatically in recent years. However, research papers which use ML often fail to provide sufficient information about their algorithms to enable results to be replicated by others in the same or different datasets.
Aim
To test the reproducibility of results from ML algorithms given three different levels of information commonly found in publications: model type alone, a description of the model, and complete algorithm.
Methods
MIMIC-III is a healthcare dataset comprising detailed information from over 60,000 intensive care unit (ICU) admissions from the Beth Israel Deaconess Medical Centre between 2001 and 2012. Access is available to everyone pending approval and completion of a short training course.
Using this dataset, three models for predicting all-cause in-hospital mortality were created, two from a PhD student working in ML, and one from an existing research paper which used the same dataset and provided complete model information. A second researcher (a PhD student in ML and cardiology) was given the same dataset and was tasked with reproducing their results. Initially, this second researcher was told what type of model was created in each case, followed by a brief description of the algorithms. Finally, the complete algorithms from each participant were provided. In all three scenarios, recreated models were compared to original models using Area Under the Receiver Operating Characteristic Curve (AUC).
Results
After excluding those younger than 18 years and events with missing or invalid entries, 21,139 ICU admissions remained from 18,094 patients between 2001 and 2012, including 2,797 in-hospital deaths. Three models were produced: two Recurrent Neural Networks (RNNs) which differed significantly in internal weights and variables, and a Boosted Tree Classifier (BTC). The AUC of the first reproduced RNN matched that of the original RNN (Figure 1), however the second RNN and the BTC could not be reproduced given model type alone. As more information was provided about these algorithms, the results from the reproduced models matched the original results more closely.
Conclusions
In order to create clinically useful ML tools with results that are reproducible and consistent, it is vital that researchers share enough detail about their models. Model type alone is not enough to guarantee reproducibility. Although some models can be recreated with limited information, this is not always the case, and the best results are found when the complete algorithm is shared. These findings have huge relevance when trying to apply ML in clinical practice.
Funding Acknowledgement
Type of funding sources: None.