Word embeddings are a popular approach to unsupervised learning of word relationships that are widely used in natural language processing. In this article, we present a new set of embeddings for medical concepts learned using an extremely large collection of multimodal medical data. Leaning on recent theoretical insights, we demonstrate how an insurance claims database of 60 million members, a collection of 20 million clinical notes, and 1.7 million full text biomedical journal articles can be combined to embed concepts into a common space, resulting in the largest ever set of embeddings for 108,477 medical concepts. To evaluate our approach, we present a new benchmark methodology based on statistical power specifically designed to test embeddings of medical concepts. Our approach, called cui2vec, attains state-of-the-art performance relative to previous methods in most instances. Finally, we provide a downloadable set of pre-trained embeddings for other researchers to use, as well as an online tool for interactive exploration of the cui2vec embeddings.
Background Admissions are generally classified as COVID-19 hospitalizations if the patient has a positive SARS-CoV-2 polymerase chain reaction (PCR) test. However, because 35% of SARS-CoV-2 infections are asymptomatic, patients admitted for unrelated indications with an incidentally positive test could be misclassified as a COVID-19 hospitalization. Electronic health record (EHR)–based studies have been unable to distinguish between a hospitalization specifically for COVID-19 versus an incidental SARS-CoV-2 hospitalization. Although the need to improve classification of COVID-19 versus incidental SARS-CoV-2 is well understood, the magnitude of the problems has only been characterized in small, single-center studies. Furthermore, there have been no peer-reviewed studies evaluating methods for improving classification. Objective The aims of this study are to, first, quantify the frequency of incidental hospitalizations over the first 15 months of the pandemic in multiple hospital systems in the United States and, second, to apply electronic phenotyping techniques to automatically improve COVID-19 hospitalization classification. Methods From a retrospective EHR-based cohort in 4 US health care systems in Massachusetts, Pennsylvania, and Illinois, a random sample of 1123 SARS-CoV-2 PCR-positive patients hospitalized from March 2020 to August 2021 was manually chart-reviewed and classified as “admitted with COVID-19” (incidental) versus specifically admitted for COVID-19 (“for COVID-19”). EHR-based phenotyping was used to find feature sets to filter out incidental admissions. Results EHR-based phenotyped feature sets filtered out incidental admissions, which occurred in an average of 26% of hospitalizations (although this varied widely over time, from 0% to 75%). The top site-specific feature sets had 79%-99% specificity with 62%-75% sensitivity, while the best-performing across-site feature sets had 71%-94% specificity with 69%-81% sensitivity. Conclusions A large proportion of SARS-CoV-2 PCR-positive admissions were incidental. Straightforward EHR-based phenotypes differentiated admissions, which is important to assure accurate public health reporting and research.
The risk profiles of post-acute sequelae of COVID-19 (PASC) have not been well characterized in multi-national settings with appropriate controls. We leveraged electronic health record (EHR) data from 277 international hospitals representing 414,602 patients with COVID-19, 2.3 million control patients without COVID-19 in the inpatient and outpatient settings, and over 221 million diagnosis codes to systematically identify new-onset conditions enriched among patients with COVID-19 during the post-acute period. Compared to inpatient controls, inpatient COVID-19 cases were at significant risk for angina pectoris (RR 1.30, 95% CI 1.09–1.55), heart failure (RR 1.22, 95% CI 1.10–1.35), cognitive dysfunctions (RR 1.18, 95% CI 1.07–1.31), and fatigue (RR 1.18, 95% CI 1.07–1.30). Relative to outpatient controls, outpatient COVID-19 cases were at risk for pulmonary embolism (RR 2.10, 95% CI 1.58–2.76), venous embolism (RR 1.34, 95% CI 1.17–1.54), atrial fibrillation (RR 1.30, 95% CI 1.13–1.50), type 2 diabetes (RR 1.26, 95% CI 1.16–1.36) and vitamin D deficiency (RR 1.19, 95% CI 1.09–1.30). Outpatient COVID-19 cases were also at risk for loss of smell and taste (RR 2.42, 95% CI 1.90–3.06), inflammatory neuropathy (RR 1.66, 95% CI 1.21–2.27), and cognitive dysfunction (RR 1.18, 95% CI 1.04–1.33). The incidence of post-acute cardiovascular and pulmonary conditions decreased across time among inpatient cases while the incidence of cardiovascular, digestive, and metabolic conditions increased among outpatient cases. Our study, based on a federated international network, systematically identified robust conditions associated with PASC compared to control groups, underscoring the multifaceted cardiovascular and neurological phenotype profiles of PASC.
The International Classification of Diseases (ICD)-10 code (U09.9) for post-acute sequelae of COVID-19 (PASC) was introduced in October of 2021. As researchers seek to leverage this billing code for research purposes in large scale real-world studies of PASC, it is of utmost importance to understand the functional use of the code by healthcare providers and the clinical characteristics of patients who have been assigned this code. To this end, we operationalized clinical case definitions of PASC using World Health Organization and Centers for Disease Control guidelines. We then chart reviewed 300 patients with COVID-19 from three participating healthcare systems of the 4CE Consortium who were assigned the U09.9 code. Chart review results showed the average positive predictive value (PPV) of the U09.9 code ranged from 40.2% to 65.4% depending on which definition of PASC was used in the evaluation. The PPV of the U09.9 code also fluctuated significantly between calendar time periods. We demonstrated the potential utility of textual data extracted from natural language processing techniques to more comprehensively capture symptoms associated with PASC from electronic health records data. Finally, we investigated the utilization of long COVID clinics in the cohort of patients. We observed that only an average of 24.0% of patients with the U09.9 code visited a long COVID clinic. Among patients who met the WHO PASC definition, only an average of 35.6% visited a long COVID clinic.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.