BACKGROUND
Further use of routinely recorded data in electronic health records (EHR) is increasingly more common, for example in epidemiological research. However, data need to be processed and prepared to allow for this further use. Within this process, different choices can be made, which could have significant consequences for research outcomes.
OBJECTIVE
The aim of this study was to investigate the influence of data processing steps involved in the secondary use of EHR data on research outcomes.
METHODS
This study used EHR data from eight Dutch general practices from 2019. These practices contributed data to two research databases: the Academic General Practitioner Development Network (AHON) registry and the Nivel Primary Care Database (Nivel-PCD). Data were extracted and processed using distinct data processing pipelines. This allowed for the evaluation of the impact of different processing methods by comparing the two datasets in a three-step approach: 1) patient demographics, 2) epidemiology of concordant patients, 3) health service utilization of patients with three diagnoses. We compared a number of indicators of similarity between the two databases, including number of contacts, regular consultations and visits, prescriptions, and episodes. Subsequently, for these three diagnoses (diabetes mellitus (DM), urinary tract infection (UTI), cough) we calculated the prevalence, number of prescriptions and number of regular consultations and visits per 1000 patient years. The outcomes were compared by performing two sample t-tests using 99% confidence intervals.
RESULTS
There was a difference in the number of enrolled patients between the two datasets (AHON registry N= 47,517, Nivel-PCD N=44,247). However, the patient demographics were similar. We found differences between all indicator outcomes of the concordant patients in both databases, i.e., the number of contacts, prescriptions and episodes per patient, except for the number of regular consultations and visits (P=.46). Differences in the indicator outcomes varied between the three diagnosis groups, whereas the number of regular consultations and visits was similar between databases for all diagnoses (DM P=<.55, UTI P=.73, cough P=.73)
CONCLUSIONS
The results illustrate the importance of awareness of researchers and other users of routine health data of the different steps in processing these data and making them available for research. Data processors should share their knowledge about these choices and researchers and policymakers should invest in their knowledge of this type of metadata. This transparency is all the more important in light of a European Health Data Space and the ever-increasing secondary use of routinely recorded health data. Future research should focus on the role of transparency and joint decision making, to minimize effects of data processing steps and to gain insight into the individual influence of processing steps on research outcomes. This could stimulate a common approach among data processors and researchers resulting in increased data interoperability.