Objective To synthesize data quality (DQ) dimensions and assessment methods of real-world data, especially electronic health records, through a systematic scoping review and to assess the practice of DQ assessment in the national Patient-centered Clinical Research Network (PCORnet). Materials and Methods We started with 3 widely cited DQ literature—2 reviews from Chan et al (2010) and Weiskopf et al (2013a) and 1 DQ framework from Kahn et al (2016)—and expanded our review systematically to cover relevant articles published up to February 2020. We extracted DQ dimensions and assessment methods from these studies, mapped their relationships, and organized a synthesized summarization of existing DQ dimensions and assessment methods. We reviewed the data checks employed by the PCORnet and mapped them to the synthesized DQ dimensions and methods. Results We analyzed a total of 3 reviews, 20 DQ frameworks, and 226 DQ studies and extracted 14 DQ dimensions and 10 assessment methods. We found that completeness, concordance, and correctness/accuracy were commonly assessed. Element presence, validity check, and conformance were commonly used DQ assessment methods and were the main focuses of the PCORnet data checks. Discussion Definitions of DQ dimensions and methods were not consistent in the literature, and the DQ assessment practice was not evenly distributed (eg, usability and ease-of-use were rarely discussed). Challenges in DQ assessments, given the complex and heterogeneous nature of real-world data, exist. Conclusion The practice of DQ assessment is still limited in scope. Future work is warranted to generate understandable, executable, and reusable DQ measures.
BackgroundDe-identification is a critical technology to facilitate the use of unstructured clinical text while protecting patient privacy and confidentiality. The clinical natural language processing (NLP) community has invested great efforts in developing methods and corpora for de-identification of clinical notes. These annotated corpora are valuable resources for developing automated systems to de-identify clinical text at local hospitals. However, existing studies often utilized training and test data collected from the same institution. There are few studies to explore automated de-identification under cross-institute settings. The goal of this study is to examine deep learning-based de-identification methods at a cross-institute setting, identify the bottlenecks, and provide potential solutions.MethodsWe created a de-identification corpus using a total 500 clinical notes from the University of Florida (UF) Health, developed deep learning-based de-identification models using 2014 i2b2/UTHealth corpus, and evaluated the performance using UF corpus. We compared five different word embeddings trained from the general English text, clinical text, and biomedical literature, explored lexical and linguistic features, and compared two strategies to customize the deep learning models using UF notes and resources.ResultsPre-trained word embeddings using a general English corpus achieved better performance than embeddings from de-identified clinical text and biomedical literature. The performance of deep learning models trained using only i2b2 corpus significantly dropped (strict and relax F1 scores dropped from 0.9547 and 0.9646 to 0.8568 and 0.8958) when applied to another corpus annotated at UF Health. Linguistic features could further improve the performance of de-identification in cross-institute settings. After customizing the models using UF notes and resource, the best model achieved the strict and relaxed F1 scores of 0.9288 and 0.9584, respectively.ConclusionsIt is necessary to customize de-identification models using local clinical text and other resources when applied in cross-institute settings. Fine-tuning is a potential solution to re-use pre-trained parameters and reduce the training time to customize deep learning-based de-identification models trained using clinical corpus from a different institution.
Transgender and gender nonconforming (TGNC) individuals face significant marginalization, stigma, and discrimination. Under-reporting of TGNC individuals is common since they are often unwilling to self-identify. Meanwhile, the rapid adoption of electronic health record (EHR) systems has made large-scale, longitudinal real-world clinical data available to research and provided a unique opportunity to identify TGNC individuals using their EHRs, contributing to a promising routine health surveillance approach. Built upon existing work, we developed and validated a computable phenotype (CP) algorithm for identifying TGNC individuals and their natal sex (i.e., male-to-female or female-to-male) using both structured EHR data and unstructured clinical notes. Our CP algorithm achieved a 0.955 F1-score on the training data and a perfect F1-score on the independent testing data. Consistent with the literature, we observed an increasing percentage of TGNC individuals and a disproportionate burden of adverse health outcomes, especially sexually transmitted infections and mental health distress, in this population.
In this study, we examined a deep learning method for de-identification of clinical notes at UF Health under a cross-institute setting. We developed deep learning models using 2014 i2b2/ UTHealth corpus and evaluated the performance using clinical notes collected from UF Health. We compared four pre-trained word embeddings, including two embeddings from the general domain and two embeddings from the clinical domain. We also explored linguistic features (i.e., word shape and part-of-speech) to further improve the performance of de-identification. The experimental results show that the performance of deep learning models trained using i2b2/ UTHealth corpus significantly dropped (strict and relax F1 scores dropped from 0.9547 and 0.9646 to 0.8360 and 0.8870) when applied to another corpus from a different institution (UF Health). Linguistic features, including word shapes and part-of-speech, could further improve the performance of de-identification in cross-institute settings (improved to 0.8527 and 0.9052).
With vast amounts of patients' medical information, electronic health records (EHRs) are becoming one of the most important data sources in biomedical and health care research. Effectively integrating data from multiple clinical sites can help provide more generalized real-world evidence that is clinically meaningful. To analyze the clinical data from multiple sites, distributed algorithms are developed to protect patient privacy without sharing individual-level medical information. In this paper, we applied the One-shot Distributed Algorithm for Cox proportional hazard model (ODAC) to the longitudinal data from the OneFlorida Clinical Research Consortium to demonstrate the feasibility of implementing the distributed algorithms in large research networks. We studied the associations between the clinical risk factors and Alzheimer's disease and related dementia (ADRD) onsets to advance clinical research on our understanding of the complex risk factors of ADRD and ultimately improve the care of ADRD patients.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.