Publicly available RNA-sequencing (RNA-seq) data are a rich resource for elucidating the mechanisms of human disease; however, preprocessing these data requires considerable bioinformatic expertise and computational infrastructure. Analyzing multiple datasets with a consistent computational workflow increases the accuracy of downstream meta-analyses. This collection of datasets represents the human intracellular transcriptional response to disorders and diseases such as acute lymphoblastic leukemia (ALL), B-cell lymphomas, chronic obstructive pulmonary disease (COPD), colorectal cancer, lupus erythematosus; as well as infection with pathogens including Borrelia burgdorferi, hantavirus, influenza A virus, Middle East respiratory syndrome coronavirus (MERS-CoV), Streptococcus pneumoniae, respiratory syncytial virus (RSV), severe acute respiratory syndrome coronavirus (SARS-CoV), and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). We calculated the statistically significant differentially expressed genes and Gene Ontology terms for all datasets. In addition, a subset of the datasets also includes results from splice variant analyses, intracellular signaling pathway enrichments as well as read mapping and quantification. All analyses were performed using well-established algorithms and are provided to facilitate future data mining activities, wet lab studies, and to accelerate collaboration and discovery.
Defining the human factors associated with severe vs mild SARS-CoV-2 infection has become of increasing interest. Mining large numbers of public gene expression datasets is an effective way to identify genes that contribute to a given phenotype. Combining RNA-sequencing data with the associated clinical metadata describing disease severity can enable earlier identification of patients who are at higher risk of developing severe COVID-19 disease. We consequently identified 356 public RNA-seq human transcriptome samples from the Gene Expression Omnibus database that had disease severity metadata. We then subjected these samples to a robust RNA-seq data processing workflow to quantify gene expression in each patient. This process involved using Salmon to map the reads to the reference transcriptomes, edgeR to calculate significant differential expression levels, and gene ontology enrichment using Camera. We then applied a machine learning algorithm to the read counts data to identify features that best differentiated samples based on COVID-19 severity phenotype. Ultimately, we produced a ranked list of genes based on their Gini importance values that includes GIMAP7 and S1PR2, which are associated with immunity and inflammation (respectively). Our results show that these two genes can potentially predict people with severe COVID-19 at up to ~90% accuracy. We expect that our findings can help contribute to the development of improved prognostics for severe COVID-19.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.