Background: Electronic medical records (EMRs) contain a wealth of information that can support data-driven decision making in healthcare policy design and service planning. Although research using EMRs has become increasingly prevalent, challenges such as coding inconsistency, data validity and lack of suitable measures in important domains still hinder the progress. Objective: Our objective is to design a structured way to process records in administrative EMR systems for health services research and assess validity in selected areas. Methods: Based on a local hospital EMR system in Singapore, we developed a structured framework for EMR data processing, including standardization and phenotyping of diagnosis codes, construction of cohort with multi-level views, and generation of variables and proxy measures to supplement primary data. Disease complexity was estimated by Charlson Comorbidity Index (CCI) and Polypharmacy Score (PPS), while socioeconomic status (SES) was estimated by housing type. Validity of modified diagnosis codes and derived measures were investigated. Results: Visit level (N=7,778,761) and patient level records (n=549,109) were generated. Diagnosis codes were standardized to ICD-9-CM with a mapping rate of 97.5%. 97.4% of the ICD-9-CM codes were phenotyped successfully using Clinical Classification Software (CCS). Diagnosis codes that underwent modification (truncation or zero-addition) in standardization and phenotyping procedures had the modification validated by physicians, with validity rates of more than 90%. Disease complexity measures (CCI and PPS) and SES were found to be valid and robust after a correlation analysis and a multivariate regression analysis. CCI and PPS were correlated with each other, and positively correlated with healthcare utilization measures. Larger housing type was associated with lower government subsidies received, suggesting association with lower SES. Profile of constructed cohorts showed differences in disease prevalence, disease complexity and hospital utilization was found in those aged above 65 and those below.
Conclusion:The framework proposed in this study would be useful for other researchers working with EMR data for health services research. Further analyses would be needed to better understand differences observed in the cohorts.