Objectives
To compare registry and EHR data mining approaches for cohort ascertainment in patients with pediatric pulmonary hypertension (PH) in an effort to overcome some of the limitations of registry enrollment alone in identifying patients with disease phenotypes.
Study design
This study was a single-center retrospective analysis of EHR and registry data at Boston Children’s Hospital. The local Informatics for Integrating Biology and the Bedside (i2b2) data warehouse was queried for billing codes, prescriptions, and narrative data related to pediatric PH. Computable phenotype algorithms were developed by fitting penalized logistic regression models to a physician-annotated training set. Algorithms were applied to a candidate patient cohort and performance was evaluated using a separate set of 136 records and 179 registry patients. We compared clinical and demographic characteristics of patients identified by computable phenotype and the registry.
Results
The computable phenotype had an area under the ROC curve of 90% (95% CI 85% – 95%), positive predictive value of 85% (95% CI 77% – 93%), and identified 413 patients (an additional 231%) with pediatric PH not enrolled in the registry. Patients identified by the computable phenotype were clinically distinct from registry patients, with greater prevalence of diagnoses related to perinatal distress and left heart disease.
Conclusions
Mining of EHRs using computable phenotypes identified a large cohort of patients not recruited using a classic registry. Fusion of EHR and registry data can improve cohort ascertainment for the study of rare diseases.
Trial Registration
ClinicalTrials.gov: NCT02249923