Introduction
Clinical notes, biomarkers, and neuroimaging have proven valuable in dementia prediction models. Whether commonly available structured clinical data can predict dementia is an emerging area of research. We aimed to predict gold-standard, research-based diagnoses of dementia including Alzheimer’s disease (AD) and/or Alzheimer’s disease related dementias (ADRD), in addition to ICD-based AD and/or ADRD diagnoses, in a well-phenotyped, population-based cohort using a machine learning approach.
Methods
Administrative healthcare data (k = 163 diagnostic features), in addition to census/vital record sociodemographic data (k = 6 features), were linked to the Cache County Study (CCS, 1995–2008).
Results
Among successfully linked UPDB-CCS participants (
n
= 4206), 522 (12.4%) had incident dementia (AD alone, AD comorbid with ADRD, or ADRD alone) as per the CCS “gold standard” assessments. Random Forest models, with a 1-year prediction window, achieved the best performance with an Area Under the Curve (AUC) of 0.67. Accuracy declined for dementia subtypes: AD/ADRD (AUC = 0.65); ADRD (AUC = 0.49). Accuracy improved when using ICD-based dementia diagnoses (AUC = 0.77).
Discussion
Commonly available structured clinical data (without labs, notes, or prescription information) demonstrate modest ability to predict “gold-standard” research-based AD/ADRD diagnoses, corroborated by prior research. Using ICD diagnostic codes to identify dementia as done in the majority of machine learning dementia prediction models, as compared to “gold-standard” dementia diagnoses, can result in higher accuracy, but whether these models are predicting true dementia warrants further research.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12911-024-02728-4.