Informative presence and observation in routine health data: A review of methodology for clinical risk prediction

Sisk, Rose; Lin, Lijing; Sperrin, Matthew; Barrett, Jessica; Tom, Brian D. M.; Díaz-Ordaz, Karla; Peek, Niels; Martin, Glen P.

doi:10.1093/jamia/ocaa242

Cited by 27 publications

(29 citation statements)

References 56 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our study predicts new onset of illness and utilizes 365 days prior observation time to apply a washout window to confirm the absence of the illness and therefore it is possible that our study could suffer from informative presence [ 15 ] and could include data from sicker patients. Therefore, we include results from two sensitivity analyses where the minimum required previous observation was set to 0 days and set to 730 days in order to assess the impact of informative presence.…”

Section: Discussionmentioning

confidence: 99%

Evaluating the impact of covariate lookback times on performance of patient-level prediction models

Hardin

Reps

2021

BMC Med Res Methodol

View full text Add to dashboard Cite

Background The goal of our study is to examine the impact of the lookback length when engineering features to use in developing predictive models using observational healthcare data. Using a longer lookback for feature engineering gives more insight about patients but increases the issue of left-censoring. Methods We used five US observational databases to develop patient-level prediction models. A target cohort of subjects with hypertensive drug exposures and outcome cohorts of subjects with acute (stroke and gastrointestinal bleeding) and chronic outcomes (diabetes and chronic kidney disease) were developed. Candidate predictors that exist on or prior to the target index date were derived within the following lookback periods: 14, 30, 90, 180, 365, 730, and all days prior to index were evaluated. We predicted the risk of outcomes occurring 1 day until 365 days after index. Ten lasso logistic models for each lookback period were generated to create a distribution of area under the curve (AUC) metrics to evaluate the discriminative performance of the models. Calibration intercept and slope were also calculated. Impact on external validation performance was investigated across five databases. Results The maximum differences in AUCs for the models developed using different lookback periods within a database was < 0.04 for diabetes (in MDCR AUC of 0.593 with 14-day lookback vs. AUC of 0.631 with all-time lookback) and 0.012 for renal impairment (in MDCR AUC of 0.675 with 30-day lookback vs. AUC of 0.687 with 365-day lookback ). For the acute outcomes, the max difference in AUC across lookbacks within a database was 0.015 (in MDCD AUC of 0.767 with 14-day lookback vs. AUC 0.782 with 365-day lookback) for stroke and < 0.03 for gastrointestinal bleeding (in CCAE AUC of 0.631 with 14-day lookback vs. AUC of 0.660 with 730-day lookback). Conclusions In general the choice of covariate lookback had only a small impact on discrimination and calibration, with a short lookback (< 180 days) occasionally decreasing discrimination. Based on the results, if training a logistic regression model for prediction then using covariates with a 365 day lookback appear to be a good tradeoff between performance and interpretation.

show abstract

Section: Discussionmentioning

confidence: 99%

Evaluating the impact of covariate lookback times on performance of patient-level prediction models

Hardin

Reps

2021

BMC Med Res Methodol

View full text Add to dashboard Cite

show abstract

“…While multiple imputation is often used in clinical prediction models because it gives unbiased estimates under the missing at random (MAR) assumption, it is unlikely that the MAR assumption holds in the routinely-collected EHR data that we use [45]. The missing indicator method that we adopt does not rely on the MAR assumption and has been found to lead to improved predictive performance in EHR data [43-45]. Furthermore, we do not seek to make prognostic predictions for patients after clinicians have identified them as entering the last few hours or days of life.…”

Section: Discussionmentioning

confidence: 99%

“…We handle missing data using the missingness indicator approach because the recording in the EHR of a clinical parameter, regardless of the value, is often indicative of the treating health professional’s contemporaneous view of the patient’s prognosis [43–44]. To do this we augment the set of potential predictors with binary variables that indicate whether, during the window of time we consider, any measurement of the corresponding parameter is available for that patient.…”

Section: Methodsmentioning

confidence: 99%

Development and validation of a dynamic 48-hour in-hospital mortality risk stratification for COVID-19 in a UK teaching hospital: a retrospective cohort study

Wiegand

et al. 2021

Preprint

View full text Add to dashboard Cite

We propose a prognostic dynamic risk stratification for 48-hour in-hospital mortality in patients with COVID-19, using demographics and routinely-collected observations and laboratory tests: age, Clinical Frailty Scale score, heart rate, respiratory rate, SpO2/FiO2 ratio, white cell count, acidosis (pH < 7.35) and Interleukin-6. We train and validate the model using data from a UK teaching hospital, adopting a landmarking approach that accounts for competing risks and informative missingness. Internal validation of the model on the first wave of patients presenting between March 1 and September 12, 2020 achieves an AUROC of 0.90 (95% CI 0.87-0.93). Temporal validation on patients presenting between September 13, 2020 and January 1, 2021 gives an AUROC of 0.91 (95% CI 0.88-0.95). The resulting mortality stratification tool has the potential to provide physicians with an assessment of a patient's evolving prognosis throughout the course of active hospital treatment.

show abstract

“…We created missingness indicators for each predictor with 1 or more missing values, which marked the observations that were missing a value. Inclusion of missingness indicators often improves predictive performance (Agor et al 2019;Sperrin et al 2020), in part because it can reflect the information-seeking behavior of clinicians stemming from medical diagnosis and evaluation (Agniel et al 2018;Groenwold 2020;Sisk et al 2021). The set of missingness indicators was analyzed for perfect collinearity, and duplicate indicators were dropped.…”

Section: Missing Datamentioning

confidence: 99%

“…Additional details are provided in the supplemental information. Multiple imputation was not necessary because our scientific goal was to characterize predictive performance for the unimputed outcome variable, rather than to estimate statistical parameters for covariates that were imputed, such as linear regression coefficients (Sisk et al 2021;Sperrin et al 2020).…”

Section: Missing Datamentioning

confidence: 99%

Development of an ensemble machine learning prognostic model to predict 60-day risk of major adverse cardiac events in adults with chest pain

Kennedy

Mark

Huang

et al. 2021

Preprint

View full text Add to dashboard Cite

Background: Chest pain is the second leading reason for emergency department (ED) visits and is commonly identified as a leading driver of low-value health care. Accurate identification of patients at low risk of major adverse cardiac events (MACE) is important to improve resource allocation and reduce over-treatment. Objectives: We sought to assess machine learning (ML) methods and electronic health record (EHR) covariate collection for MACE prediction. We aimed to maximize the pool of low-risk patients that are accurately predicted to have less than 0.5% MACE risk and may be eligible for reduced testing. Population Studied: 116,764 adult patients presenting with chest pain in the ED and evaluated for potential acute coronary syndrome (ACS). 60-day MACE rate was 1.9%. Methods: We evaluated ML algorithms (lasso, splines, random forest, extreme gradient boosting, Bayesian additive regression trees) and SuperLearner stacked ensembling. We tuned ML hyperparameters through nested ensembling, and imputed missing values with generalized low-rank models (GLRM). We benchmarked performance to key biomarkers, validated clinical risk scores, decision trees, and logistic regression. We explained the models through variable importance ranking and accumulated local effect visualization. Results: The best discrimination (area under the precision-recall [PR-AUC] and receiver operating characteristic [ROC-AUC] curves) was provided by SuperLearner ensembling (0.148, 0.867), followed by random forest (0.146, 0.862). Logistic regression (0.120, 0.842) and decision trees (0.094, 0.805) exhibited worse discrimination, as did risk scores [HEART (0.064, 0.765), EDACS (0.046, 0.733)] and biomarkers [serum troponin level (0.064, 0.708), electrocardiography (0.047, 0.686)]. The ensemble's risk estimates were miscalibrated by 0.2 percentage points. The ensemble accurately identified 50% of patients to be below a 0.5% 60-day MACE risk threshold. The most important predictors were age, peak troponin, HEART score, EDACS score, and electrocardiogram. GLRM imputation achieved 90% reduction in root mean-squared error compared to median-mode imputation. Conclusion: Use of ML algorithms, combined with broad predictor sets, improved MACE risk prediction compared to simpler alternatives, while providing calibrated predictions and interpretability. Standard risk scores may neglect important health information available in other characteristics and combined in nuanced ways via ML.

show abstract

Informative presence and observation in routine health data: A review of methodology for clinical risk prediction

Cited by 27 publications

References 56 publications

Evaluating the impact of covariate lookback times on performance of patient-level prediction models

Evaluating the impact of covariate lookback times on performance of patient-level prediction models

Development and validation of a dynamic 48-hour in-hospital mortality risk stratification for COVID-19 in a UK teaching hospital: a retrospective cohort study

Development of an ensemble machine learning prognostic model to predict 60-day risk of major adverse cardiac events in adults with chest pain

Contact Info

Product

Resources

About