ObjectiveCombining population‐based health registries and electronic health records offers the opportunity to create large, phenotypically detailed patient cohorts of high quality. In this study, we used text mining of clinical notes to confirm ICD‐10‐registered epilepsy diagnoses and classify patients according to focal and generalized epilepsy types.MethodsUsing the Danish National Patient Registry, we identified patients who between 2006 and 2016 received an ICD‐10 diagnosis of epilepsy. To validate the epilepsy diagnosis and stratify patients into focal and generalized epilepsy types, we constructed dictionaries for text mining‐based extraction of clinical notes. Two physicians manually reviewed the clinical notes for a total of 527 patients and assigned epilepsy diagnoses which were compared with the text mined diagnoses.ResultsWe identified 23,632 patients with an ICD‐10 diagnosis of epilepsy of which 50% were registered with an unspecified epilepsy diagnosis. In total, 11,211 patients were text mining considered likely to have epilepsy with an F1 measure ranging from 82%‐90%. Manual review of the electronic health records for 310 patients revealed a false discovery rate of 29%. This rate was decreased to 4% by the text mining algorithm. The weighted average F1 measure for text mining assigned epilepsy types was 79% (82% for focal and 76% for generalized epilepsy). Text mining successfully assigned a focal or generalized epilepsy type to 92% of the text mining eligible patients registered with unspecified epilepsy.SignificanceText mining of electronic health records can be used to establish a patient cohort with much higher likelihood of having a diagnosis of epilepsy and a focal or generalized epilepsy type compared to the cohort created from ICD‐10 epilepsy codes alone. We believe the concept will be essential for future genome‐wide and phenome‐wide association studies and subsequently the development of precision medicine for epilepsy patients.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.