Objective COVID-19 poses societal challenges that require expeditious data and knowledge sharing. Though organizational clinical data are abundant, these are largely inaccessible to outside researchers. Statistical, machine learning, and causal analyses are most successful with large-scale data beyond what is available in any given organization. Here, we introduce the National COVID Cohort Collaborative (N3C), an open science community focused on analyzing patient-level data from many centers. Methods The Clinical and Translational Science Award (CTSA) Program and scientific community created N3C to overcome technical, regulatory, policy, and governance barriers to sharing and harmonizing individual-level clinical data. We developed solutions to extract, aggregate, and harmonize data across organizations and data models, and created a secure data enclave to enable efficient, transparent, and reproducible collaborative analytics. Organized in inclusive workstreams, in two months we created: legal agreements and governance for organizations and researchers; data extraction scripts to identify and ingest positive, negative, and possible COVID-19 cases; a data quality assurance and harmonization pipeline to create a single harmonized dataset; population of the secure data enclave with data, machine learning, and statistical analytics tools; dissemination mechanisms; and a synthetic data pilot to democratize data access. Discussion The N3C has demonstrated that a multi-site collaborative learning health network can overcome barriers to rapidly build a scalable infrastructure incorporating multi-organizational clinical data for COVID-19 analytics. We expect this effort to save lives by enabling rapid collaboration among clinicians, researchers, and data scientists to identify treatments and specialized care and thereby reduce the immediate and long-term impacts of COVID-19. LAY SUMMARY COVID-19 poses societal challenges that require expeditious data and knowledge sharing. Though medical records are abundant, they are largely inaccessible to outside researchers. Statistical, machine learning, and causal research are most successful with large datasets beyond what is available in any given organization. Here, we introduce the National COVID Cohort Collaborative (N3C), an open science community focused on analyzing patient-level data from many clinical centers to reveal patterns in COVID-19 patients. To create N3C, the community had to overcome technical, regulatory, policy, and governance barriers to sharing patient-level clinical data. In less than 2 months, we developed solutions to acquire and harmonize data across organizations and created a secure data environment to enable transparent and reproducible collaborative research. We expect the N3C to help save lives by enabling collaboration among clinicians, researchers, and data scientists to identify treatments and specialized care needs and thereby reduce the immediate and long-term impacts of COVID-19.
Background In response to COVID-19, the informatics community united to aggregate as much clinical data as possible to characterize this new disease and reduce its impact through collaborative analytics. The National COVID Cohort Collaborative (N3C) is now the largest publicly available HIPAA limited dataset in US history with over 6.4 million patients and is a testament to a partnership of over 100 organizations. Methods We developed a pipeline for ingesting, harmonizing, and centralizing data from 56 contributing data partners using four federated Common Data Models. N3C Data quality (DQ) review involves both automated and manual procedures. In the process, several DQ heuristics were discovered in our centralized context, both within the pipeline and during downstream project-based analysis. Feedback to the sites led to many local and centralized DQ improvements. Results Beyond well-recognized DQ findings, we discovered 15 heuristics relating to source CDM conformance, demographics, COVID tests, conditions, encounters, measurements, observations, coding completeness and fitness for use. Of 56 sites, 37 sites (66%) demonstrated issues through these heuristics. These 37 sites demonstrated improvement after receiving feedback. Discussion We encountered site-to-site differences in DQ which would have been challenging to discover using federated checks alone. We have demonstrated that centralized DQ benchmarking reveals unique opportunities for data quality improvement that will support improved research analytics locally and in aggregate. Conclusion By combining rapid, continual assessment of DQ with a large volume of multi-site data, it is possible to support more nuanced scientific questions with the scale and rigor that they require.
Objectives To define pregnancy episodes and estimate gestational age within electronic health record (EHR) data from the National COVID Cohort Collaborative (N3C). Materials and Methods We developed a comprehensive approach, named Hierarchy and rule-based pregnancy episode Inference integrated with Pregnancy Progression Signatures (HIPPS), and applied it to EHR data in the N3C (1/1/2018-4/7/2022). HIPPS combines: 1) an extension of a previously published pregnancy episode algorithm, 2) a novel algorithm to detect gestational age-specific signatures of a progressing pregnancy for further episode support, and 3) pregnancy start date inference. Clinicians performed validation of HIPPS on a subset of episodes. We then generated pregnancy cohorts based on gestational age precision and pregnancy outcomes for assessment of accuracy and comparison of COVID-19 and other characteristics. Results We identified 628,165 pregnant persons with 816,471 pregnancy episodes, of which 52.3% were live births, 24.4% were other outcomes (stillbirth, ectopic pregnancy, abortions), and 23.3% had unknown outcomes. Clinician validation agreed 98.8% with HIPPS-identified episodes. We were able to estimate start dates within one week of precision for 475,433 (58.2%) episodes. 62,540 (7.7%) episodes had incident COVID-19 during pregnancy. Discussion HIPPS provides measures of support for pregnancy-related variables such as gestational age and pregnancy outcomes based on N3C data. Gestational age precision allows researchers to find time to events with reasonable confidence. Conclusion We have developed a novel and robust approach for inferring pregnancy episodes and gestational age that addresses data inconsistency and missingness in EHR data. Lay Summary The National COVID Cohort Collaborative (N3C) provides researchers a unique opportunity to use electronic health record data from more than 12 million individuals from over seventy healthcare systems across the U.S. to study the impact of COVID-19 on pregnancy and women’s health. However, doing research with electronic health record data from different sources can be challenging as data can often be reported in many ways and formats. To address this challenge, we developed an approach known as Hierarchy and rule-based pregnancy episode Inference integrated with Pregnancy Progression Signatures (HIPPS) that can 1) find the start and end of a pregnancy, 2) infer whether the pregnancy resulted in a live birth or pregnancy loss, and 3) determine the gestational age at the end of pregnancy. We observed from a subset of data that our approach had high agreement with how clinicians would collect this information from electronic health records. When applying our approach on all the data in N3C, we identified 816K pregnancies from 628K individuals. Of these individuals, 62K had COVID-19 during pregnancy. Our research demonstrates that our HIPPS approach can enable COVID-19-related research in pregnancy with electronic health record data.
Institutions must decide how to manage the use of clinical data to support research while ensuring appropriate protections are in place. Questions about data use and sharing often go beyond what the Health Insurance Portability and Accountability Act of 1996 (HIPAA) considers. In this article, we describe our institution’s governance model and approach. Common questions we consider include (1) Is a request limited to the minimum data necessary to carry the research forward? (2) What plans are there for sharing data externally?, and (3) What impact will the proposed use of data have on patients and the institution? In 2020, 302 of the 319 requests reviewed were approved. The majority of requests were approved in less than 2 weeks, with few or no stipulations. For the remaining requests, the governance committee works with researchers to find solutions to meet their needs while also addressing our collective goal of protecting patients.
Background Identifying individuals with a higher risk of developing severe COVID-19 outcomes will inform targeted or more intensive clinical monitoring and management. To date, there is mixed evidence regarding the impact of pre-existing autoimmune disease (AID) diagnosis and/or immunosuppressant (IS) exposure on developing severe COVID-19 outcomes. Methods A retrospective cohort of adults diagnosed with COVID-19 was created in the National COVID Cohort Collaborative enclave. Two outcomes, life-threatening disease, and hospitalization were evaluated by using logistic regression models with and without adjustment for demographics and comorbidities. Results Of the 2,453,799 adults diagnosed with COVID-19, 191,520 (7.81%) had a pre-existing AID diagnosis and 278,095 (11.33%) had a pre-existing IS exposure. Logistic regression models adjusted for demographics and comorbidities demonstrated that individuals with a pre-existing AID (OR = 1.13, 95% CI 1.09 - 1.17; P< 0.001), IS (OR= 1.27, 95% CI 1.24 - 1.30; P< 0.001), or both (OR = 1.35, 95% CI 1.29 - 1.40; P< 0.001) were more likely to have a life-threatening COVID-19 disease. These results were consistent when evaluating hospitalization. A sensitivity analysis evaluating specific IS revealed that TNF inhibitors were protective against life-threatening disease (OR = 0.80, 95% CI 0.66- 0.96; P=0.017) and hospitalization (OR = 0.80, 95% CI 0.73 - 0.89; P< 0.001). Conclusions Patients with pre-existing AID, exposure to IS, or both are more likely to have a life-threatening disease or hospitalization. These patients may thus require tailored monitoring and preventative measures to minimize negative consequences of COVID-19.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.