Stroke is the second leading cause of death in the world and top cause of disability in the US.The plurality of electronic health records (EHR) provides an opportunity to study this disease in situ. Doing so requires accurately identifying stroke patients from medical records. So-called "EHR phenotyping" algorithms, however, are difficult and time consuming to create and often must rely on incomplete information. There is an opportunity to use machine learning to speed up and ease the process of cohort and feature identification. We systematically compared and evaluated the ability of several machine learning algorithms to automatically phenotype acute ischemic stroke patients. We found that these algorithms can achieve high performance (e.g. average AUROC=0.955%) with little to no manual feature curation, and other performance evaluators differentiate each model's ability to generalize. We also found that commonly available data such as diagnosis codes can be used as noisy proxies for training when a reference panel of stroke patients is unavailable. Additionally, we find some limitations when the algorithms are used to place patients into stroke risk classes. We used these models to identify unidentified stroke patients from our patient population of 6.4 million and find expected rates of stroke across the population.Stroke is a highly heterogeneous and complex disease that is a leading cause of death and 2 severe disability for millions of survivors worldwide. 1 It is characterized by an acute focal loss 3 of neurological function and is primarily caused by loss of blood flow to a specific area of the 4 brain. There are many identifiable risk factors for stroke, which include various metabolic, 5 cardiovascular, and coagulative diseases, medications, lifestyle, and demographics. Triggers 6 2 such as pollution, infection, and inflammatory disorders, further complicate the etiology of 7 the disease. 2 Most of the unidentified risk, up to 40%, is thought to be genetic. 3 Accurate de-8 termination of the etiology of disease is essential for risk stratification and optimal treatment, 9 but this can be difficult as up to 35% of strokes are of undetermined cause. 4,5 10 Traditionally, identifying a stroke patient requires the integration of multiple facets of data 11 including medical notes, labs, imaging reports, and medical experience by neurologists. This 12 requires time consuming manual review. Stroke diagnoses have also been missed or falsely 13 assigned 6 . Stroke is often coded in outpatient follow up, so a hospital EHR may not have 14 the ICD9 or 10 data to identify stroke with high sensitivity. Given the incompleteness of 15 identifying patients using stroke-specific ICD9 codes and the availability of structured multi-16 modal data in EHR settings, there is a need to move beyond laborious manual chart review 17 and to automate identification of stroke patients with commonly accessible EHR data.
18Phenotyping algorithms must address two tasks: curating features to define the pheno-19 type, and identifying ca...