We present single-cell interpretation via multikernel learning (SIMLR), an analytic framework and software which learns a similarity measure from single-cell RNA-seq data in order to perform dimension reduction, clustering and visualization. On seven published data sets, we benchmark SIMLR against state-of-the-art methods. We show that SIMLR is scalable and greatly enhances clustering performance while improving the visualization and interpretability of single-cell sequencing data.
Supplementary data are available at Bioinformatics online.
Recurrent successions of genomic changes, both within and between patients, reflect repeated evolutionary processes that are valuable for the anticipation of cancer progression. Multi-region sequencing allows the temporal order of some genomic changes in a tumor to be inferred, but the robust identification of repeated evolution across patients remains a challenge. We developed a machine-learning method based on transfer learning that allowed us to overcome the stochastic effects of cancer evolution and noise in data and identified hidden evolutionary patterns in cancer cohorts. When applied to multi-region sequencing datasets from lung, breast, renal, and colorectal cancer (768 samples from 178 patients), our method detected repeated evolutionary trajectories in subgroups of patients, which were reproduced in single-sample cohorts (n = 2,935). Our method provides a means of classifying patients on the basis of how their tumor evolved, with implications for the anticipation of disease progression.
Single-cell RNA-seq technologies enable high throughput gene expression measurement of individual cells, and allow the discovery of heterogeneity within cell populations. Measurement of cell-to-cell gene expression similarity is critical to identification, visualization and analysis of cell populations. However, single-cell data introduce challenges to conventional measures of gene expression similarity because of the high level of noise, outliers and dropouts. Here, we propose a novel similarity-learning framework, SIMLR (single-cell interpretation via multikernel learning), which learns an appropriate distance metric from the data for dimension reduction, clustering and visualization applications. Benchmarking against state-of-the-art methods for these applications, we used SIMLR to re-analyse seven representative single-cell data sets, including high-throughput droplet-based data sets with tens of thousands of cells. We show that SIMLR greatly improves clustering sensitivity and accuracy, as well as the visualization and interpretability of the data. Background Single-cell RNA sequencing (scRNA-seq) technologies have recently emerged as a powerful means to measure gene expression levels of individual cells and to reveal previously unknown heterogeneity and functional diversity among cell populations [1]. Quantifying the variation across gene expression profiles of individual cells is key to the identification and analysis of complex cell populations that arise in neurology [2], immunology [3], oncology [4] and developmental biology [5]. The heterogeneity identified across individual cells can answer questions irresolvable by traditional ensemble-based methods, where gene expression measurements are averaged over a population of cells pooled together [6], [7]. Recent studies have demonstrated that de novo cell type discovery and identification of functionally distinct cell subpopulations are possible via unbiased analysis of all transcriptomic information provided by scRNA-seq data [8]. Therefore, unsupervised clustering of individual cells using scRNA-seq data is critical to developing new biological insights and validating prior knowledge. Many existing single-cell studies employ computational and statistical methods that have been developed primarily for analysis of data from traditional bulk RNA-seq methods [8]-[10]. These methods do not address the unique characteristics that make single-cell expression data especially challenging to analyze: outlier cell populations, transcript amplification noise, and biological effects such as the cell cycle [11]. In addition, it has been shown that many statistical methods fail to alleviate other underlying challenges, such as dropout events, where zero expression measurements occur due to sampling or stochastic transcriptional activities [12]. Recently, new single-cell platforms such as DropSeq [13], InDrop [14] and GemCode single-cell technology [15] have enabled a dramatic increase in throughput to thousands of cells. These platforms have adapted recent sequencing p...
To dissect the mechanisms underlying the inflation of variants in the Severe Acute Respiratory Syndrome CoronaVirus 2 (SARS-CoV-2) genome, we present a largescale analysis of intra-host genomic diversity, which reveals that most samples exhibit heterogeneous genomic architectures, due to the interplay between host-related mutational processes and transmission dynamics. The decomposition of minor variants profiles unveils three non-overlapping mutational signatures related to nucleotide substitutions and likely ruled by APOlipoprotein B Editing Complex (APOBEC), Reactive Oxygen Species (ROS), and Adenosine Deaminase Acting on RNA (ADAR), highlighting heterogeneous host responses to SARS-CoV-2 infections. A corrected-for-signatures dN/dS analysis demonstrates that such mutational processes are affected by purifying selection, with important exceptions. In fact, several mutations appear to transit toward clonality, defining new clonal genotypes that increase the overall genomic diversity. Furthermore, the phylogenomic analysis shows the presence of homoplasies and supports the hypothesis of transmission of minor variants. This study paves the way for the integrated analysis of intra-host genomic diversity and clinical outcomes of SARS-CoV-2 infections.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.