A key requirement for longitudinal studies using routinely-collected health data is to be able to measure what individuals are present in the datasets used, and over what time period. Individuals can enter and leave the covered population of administrative datasets for a variety of reasons, including both life events and characteristics of the datasets themselves. An automated, customizable method of determining individuals' presence was developed for the primary care dataset in Swansea University's SAIL Databank. The primary care dataset covers only a portion of Wales, with 76% of practices participating. The start and end date of the data varies by practice. Additionally, individuals can change practices or leave Wales. To address these issues, a two step process was developed. First, the period for which each practice had data available was calculated by measuring changes in the rate of events recorded over time. Second, the registration records for each individual were simplified. Anomalies such as short gaps and overlaps were resolved by applying a set of rules. The result of these two analyses was a cleaned set of records indicating start and end dates of available primary care data for each individual. Analysis of GP records showed that 91.0% of events occurred within periods calculated as having available data by the algorithm. 98.4% of those events were observed at the same practice of registration as that computed by the algorithm. A standardized method for solving this common problem has enabled faster development of studies using this data set. Using a rigorous, tested, standardized method of verifying presence in the study population will also positively influence the quality of research.
BackgroundElectronic health record (EHR) data are available for research in all UK nations and cross-nation comparative studies are becoming more common. All UK inpatient EHRs are based around episodes, but episode-based analysis may not sufficiently capture the patient journey. There is no UK-wide method for aggregating episodes into standardised person-based spells. This study identifies two data quality issues affecting the creation of person-based spells, and tests four methods to create these spells, for implementation across all UK nations.MethodsWelsh inpatient EHRs from 2013 to 2017 were analysed. Phase one described two data quality issues; transfers of care and episode sequencing. Phase two compared four methods for creating person spells. Measures were mean length of stay (LOS, expressed in days) and number of episodes per person spell for each method.Results3.5% of total admissions were transfers-in and 3.1% of total discharges were transfers-out. 68.7% of total transfers-in and 48.7% of psychiatric transfers-in had an identifiable preceding transfer-out, and 78.2% of total transfers-out and 59.0% of psychiatric transfers-out had an identifiable subsequent transfer-in. 0.2% of total episodes and 4.0% of psychiatric episodes overlapped with at least one other episode of any specialty.Method one (no evidence of transfer required; overlapping episodes grouped together) resulted in the longest mean LOS (4.0 days for all specialties; 48.5 days for psychiatric specialties) and the fewest single episode person spells (82.4% of all specialties; 69.7% for psychiatric specialties). Method three (evidence of transfer required; overlapping episodes separated) resulted in the shortest mean LOS (3.7 days for all specialties; 45.8 days for psychiatric specialties) and the most single episode person spells; (86.9% for all specialties; 86.3% for psychiatric specialties).ConclusionsTransfers-in appear better recorded than transfers-out. Transfer coding is incomplete, particularly for psychiatric specialties. The proportion of episodes that overlap is small but psychiatric episodes are disproportionately affected.The most successful method for grouping episodes into person spells aggregated overlapping episodes and required no evidence of transfer from admission source/method or discharge destination codes. The least successful method treated overlapping episodes as distinct and required transfer coding. The impact of all four methods was greater for psychiatric specialties.
IntroductionWhen datasets are collected mainly for administrative rather than research purposes, data quality checks are necessary to ensure robust findings and to avoid biased results due to incomplete or inaccurate data. When done manually, data quality checks are time-consuming. We introduced automation to speed up the process and save effort. Objectives and ApproachWe have devised a set of automated generic quality checks and reporting, which can be run on any dataset in a relational database without any dataset-specific knowledge or configuration. The code is written in Python. Checks include: linkage quality, agreement with a population data source, comparison with previous data version, duplication checks, null count, value distribution and range, etc. Where dataset metadata is available, checks for validity against lookup tables are included, and the output report includes documentation on data contents. An HTML report with dynamic datatables and interactive graphs, allowing easy exploration of the results, is produced using RMarkdown. ResultsThe automation of the generic data quality check provides an easy and quick tool to report on data issues with minimal effort. It allows comparison with reference tables, lookups and previous versions of the same table to highlight differences. Moreover, this tool can be provided for researchers as a means to get more detailed understanding about their data. While other research data quality tools exist, this tool is distinguished by its features specific to linked data research, as well as implementation in a relational database environment. It has been successfully tested on datasets of over two billion rows. The tool was designed for use within the SAIL Databank, but could easily be adapted and used in other settings. Conclusion/ImplicationsThe effort spent on automating generic testing and reporting on data quality of research datasets is more than compensated by its outputs. Benefits include quick detection and scrutiny of many sources of invalid and incomplete data. This process can easily be expanded to accommodate more standard tests.
ObjectivesThe SAIL databank brings together a range of datasets gathered primarily for administrative rather than research processes. These datasets contain information regarding different aspects of an individual's contact with services which when combined form a detailed health record for individuals living (or deceased) in Wales.Understanding the quality of data in SAIL supports the research process by providing a level of assurance about the robustness of data, identifying and describing where there may be sources of potential bias due to invalid, incomplete, inconsistent or inaccurate data and therefore helping to increase the accuracy of research using these data.Designing processes to investigate and report on data quality within and between multiple datasets can be a time-consuming task to undertake; it requires a high degree of effort to ensure it is genuinely meaningful and useful to SAIL users and may require a range of different approaches. ApproachData quality tests for each dataset were written, considering a range of data quality dimensions including validity, consistency, accuracy and completeness. Tests were designed to capture not just the quality of data within each dataset, but also to assess consistency of data items between datasets. SQL scripts were written to test each of these aspects: in order to minimise repetition, automated processes were implemented where appropriate.Batch automation was used to called SQL stored procedures, which utilise metadata to generate dynamic SQL. The metadata (created as part of the data quality process) describes each dataset and the measurement parameters used to assess each field within the dataset. However automation on its own is insufficient and data quality process outputs require scrutiny and oversight to ensure they are actually capturing what they set out to do. SAIL users were consulted on the development of the data quality reports to ensure usability and appropriateness to support data utilisation for research. ResultsThe data quality reporting process is beneficial to the SAIL databank as it provides additional information to support the research process and in some cases may act as a diagnostic tool, detecting problems with data which can then be rectified. ConclusionThe development of data quality processes in SAIL is ongoing, and changes or developments in each dataset lead to new requirements for data quality measurement and reporting. A vital component of the process is the production of output that is genuinely meaningful and useful.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.