McMurry et al. assess the real-world safety of the BNT162b2 and mRNA-1273 COVID-19 vaccines. Using natural language processing, they compare the rates of specified adverse effects between 68,266 vaccinated individuals and 68,266 matched unvaccinated individuals. They find that both vaccines are safe and tolerated in clinical practice.
Technology to generate single cell RNA-sequencing (scRNA-seq) datasets and tools to annotate them have advanced rapidly in the past several years. Such tools generally rely on existing transcriptomic datasets or curated databases of cell type defining genes, while the application of scalable natural language processing (NLP) methods to enhance analysis workflows has not been adequately explored. Here we deployed an NLP framework to objectively quantify associations between a comprehensive set of over 20,000 human protein-coding genes and over 500 cell type terms across over 26 million biomedical documents. The resultant gene-cell type associations (GCAs) are significantly stronger between a curated set of matched cell type-marker pairs than the complementary set of mismatched pairs (Mann Whitney p = 6.15 × 10−76, r = 0.24; cohen’s D = 2.6). Building on this, we developed an augmented annotation algorithm (single cell Annotation via Literature Encoding, or scALE) that leverages GCAs to categorize cell clusters identified in scRNA-seq datasets, and we tested its ability to predict the cellular identity of 133 clusters from nine datasets of human breast, colon, heart, joint, ovary, prostate, skin, and small intestine tissues. With the optimized settings, the true cellular identity matched the top prediction in 59% of tested clusters and was present among the top five predictions for 91% of clusters. scALE slightly outperformed an existing method for reference data driven automated cluster annotation, and we demonstrate that integration of scALE can meaningfully improve the annotations derived from such methods. Further, contextualization of differential expression analyses with these GCAs highlights poorly characterized markers of well-studied cell types, such as CLIC6 and DNASE1L3 in retinal pigment epithelial cells and endothelial cells, respectively. Taken together, this study illustrates for the first time how the systematic application of a literature-derived knowledge graph can expedite and enhance the annotation and interpretation of scRNA-seq data.
As the COVID-19 vaccination campaign unfolds as one of the most rapid and widespread in history, it is important to continuously assess the real world safety of the FDA-authorized vaccines. Curation from large-scale electronic health records (EHRs) allows for near real-time safety evaluations that were not previously possible. Here, we advance context- and sentiment-aware deep neural networks over the multi-state Mayo Clinic enterprise (Minnesota, Arizona, Florida, Wisconsin) for automatically curating the adverse effects mentioned by physicians in over 108,000 EHR clinical notes between December 1st 2020 to February 8th 2021. We retrospectively compared the clinical notes of 31,069 individuals who received at least one dose of the Pfizer/BioNTech or Moderna vaccine to those of 31,069 unvaccinated individuals who were propensity matched by demographics, residential location, and history of prior SARS-CoV-2 testing. We find that vaccinated and unvaccinated individuals were seen in the the clinic at similar rates within 21 days of the first or second actual or assigned vaccination dose (first dose Odds Ratio = 1.13, 95% CI: 1.09-1.16; second dose Odds Ratio = 0.89, 95% CI: 0.84-0.93). Further, the incidence rates of all surveyed adverse effects were similar or lower in vaccinated individuals compared to unvaccinated individuals after either vaccine dose. Finally, the most frequently documented adverse effects within 7 days of each vaccine dose were fatigue (Dose 1: 1.77%, Dose 2: 1.2%),nausea (Dose 1: 1.05%, Dose 2: 0.84%), myalgia (Dose 1: 0.67%; Dose 2: 0.66%), diarrhea (Dose 1: 0.67%; Dose 2: 0.46%), arthralgia (Dose 1: 0.64%; Dose 2: 0.57%), erythema (Dose 1: 0.59%; Dose 2: 0.46%), vomiting (Dose 1: 0.45%, Dose 2: 0.29%) and fever (Dose 1: 0.29%; Dose 2: 0.23%). These remarkably low frequencies of adverse effects recorded in EHRs versus those derived from active solicitation during clinical trials (arthralgia: 24-46%; erythema: 9.5-14.7%; myalgia: 38-62%; fever: 14.2-15.5%) emphasize the rarity of vaccine-associated adverse effects requiring clinical attention. This rapid and timely analysis of vaccine-related adverse effects from contextually rich EHR notes of 62,138 individuals, which was enabled through a large scale Artificial Intelligence (AI)-powered platform, reaffirms the safety and tolerability of the FDA-authorized COVID-19 vaccines in practice.
Technology to generate single cell RNA-sequencing (scRNA-seq) datasets and tools to annotate them have rapidly advanced in the past several years. Such tools generally rely on existing transcriptomic datasets or curated databases of cell type defining genes, while the application of scalable natural language processing (NLP) methods to enhance analysis workflows has not been adequately explored. Here we deployed an NLP framework to objectively quantify associations between a comprehensive set of over 20,000 human protein-coding genes and over 500 cell type terms across over 26 million biomedical documents. The resultant gene-cell type associations (GCAs) are significantly stronger between a curated set of matched cell type-marker pairs than the complementary set of mismatched pairs (Mann Whitney p < 6.15x10-76, r = 0.24; Cohens D = 2.6). Building on this, we developed an augmented annotation algorithm that leverages GCAs to categorize cell clusters identified in scRNA-seq datasets, and we tested its ability to predict the cellular identity of 185 clusters in 13 datasets from human blood, pancreas, lung, liver, kidney, retina, and placenta. With the optimized settings, the true cellular identity matched the top prediction in 66% of tested clusters and was present among the top five predictions for 94% of clusters. Further, contextualization of differential expression analyses with these GCAs highlights poorly characterized markers of established cell types, such as CLIC6 and DNASE1L3 in retinal pigment epithelial cells and endothelial cells, respectively. Taken together, this study illustrates for the first time how the systematic application of a literature derived knowledge graph can expedite and enhance the annotation and interpretation of scRNA-seq data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.