The high-dimensional data created by high-throughput technologies require visualization tools that reveal data structure and patterns in an intuitive form. We present PHATE, a visualization method that captures both local and global nonlinear structure using an information-geometric distance between datapoints. We compared PHATE to other tools on a variety of artificial and biological *
Single-cell RNA-sequencing (scRNA-seq) is a powerful tool to quantify transcriptional states in thousands to millions of cells. It is increasingly common for scRNA-seq data to be collected in multiple conditions to measure the effect of an experimental perturbation. However, quantifying differences between scRNA-seq datasets remains an analytical challenge. Previous efforts at quantifying such differences focus on discrete regions of the transcriptional state space such as clusters of cells. Here, we describe a continuous measure of the effect of an experiment across the transcriptomic space with single cell resolution. First, we use the manifold assumption to model the cellular state space as a graph with cells as nodes and edges connecting cells with similar transcriptomic profiles. Next, we calculate an Enhanced Experimental Signal (EES) that estimates the likelihood of observing cells from each condition at every point in the manifold. We show that the EES has useful properties for analysis of single cell perturbation studies. We show that we can use the magnitude and frequency of the EES, using an algorithm we call vertex frequency clustering, to identify specific populations of cells that are or are not affected by an experimental treatment at the appropriate level of granularity. Using these selected populations we can derive gene signatures of affected populations of cells. We demonstrate both algorithms using a combination of biological and synthetic datasets. Implementations are provided in the MELD Python package, which is available at https://github.com/KrishnaswamyLab/MELD. IntroductionAs single-cell RNA-sequencing (scRNA-seq) has become more accessible, the design of single-cell experiments has become increasingly complex. Researchers regularly use scRNA-seq to quantify the effect of a drug, gene knockout, or other experimental perturbation on a biological system. However, quantifying the 1 .
In the era of 'Big Data' there is a pressing need for tools that provide human interpretable visualizations of emergent patterns in high-throughput high-dimensional data. Further, to enable insightful data exploration, such visualizations should faithfully capture and emphasize emergent structures and patterns without enforcing prior assumptions on the shape or form of the data. In this paper, we present PHATE (Potential of Heat-diffusion for Affinity-based Transition Embedding) -an unsupervised low-dimensional embedding for visualization of data that is aimed at solving these issues. Unlike previous methods that are commonly used for visualization, such as PCA and tSNE, PHATE is able to capture and highlight both local and global structure in the data. In particular, in addition to clustering patterns, PHATE also uncovers and emphasizes progression and transitions (when they exist) in the data, which are often missed in other visualization-capable methods. Such 24, 2017; patterns are especially important in biological data that contain, for example, single-cell phenotypes at different phases of differentiation, patients at different stages of disease progression, and gut microbial compositions that vary gradually between individuals, even of the same enterotype.International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/120378 doi: bioRxiv preprint first posted online Mar.The embedding provided by PHATE is based on a novel informational distance that captures long-range nonlinear relations in the data by computing energy potentials of dataadaptive diffusion processes. We demonstrate the effectiveness of the produced visualization in revealing insights on a wide variety of biomedical data, including single-cell RNA-sequencing, mass cytometry, gut microbiome sequencing, human SNP data, Hi-C data, as well as non-biomedical data, such as facebook network and facial image data. In order to validate the capability of PHATE to enable exploratory analysis, we generate a new dataset of 31,000 single-cells from a human embryoid body differentiation system. Here, PHATE provides a comprehensive picture of the differentiation process, while visualizing major and minor branching trajectories in the data. We validate that all known cell types are recapitulated in the PHATE embedding in proper organization. Furthermore, the global picture of the system offered by PHATE allows us to connect parts of the developmental progression and characterize novel regulators associated with developmental lineages.
4Single-cell RNA-sequencing (scRNA-seq) is a powerful tool to quantify transcriptional states in 5 thousands to millions of cells. It is increasingly common for scRNA-seq data to be collected in 6 multiple experimental conditions, yet quantifying differences between scRNA-seq datasets re-7 mains an analytical challenge. Previous efforts at quantifying such differences focus on discrete 8 regions of the transcriptional state space such as clusters of cells. Here, we describe a contin-9 uous measure of the effect of an experiment across the transcriptomic space. First, we use the 10 manifold assumption to model the cellular state space as a graph (or network) with cells as nodes 11 and edges connecting cells with similar transcriptomic profiles. Next, we create an Enhanced 12 Experimental Signal (EES) that estimates the likelihood of observing cells from each condition 13 at every point in the manifold. We show that the EES has useful properties and information that 14 can be extracted. The EES can be used to identify how gene expression is affected by a given 15 perturbation, including identifying non-monotonic changes from only two conditions. We also 16 show that we can use both the magnitude and frequency of the EES, using an algorithm we 17 call vertex frequency clustering, to derive subsets of cells at appropriate levels of granularity 18 (tailored to areas that change) that are enriched in the experimental or control conditions or that 19 are unaffected between conditions. We demonstrate both algorithms using a combination of 20 biological and synthetic datasets. Implementations are provided in the MELD Python package, 21 which is available at https://github.com/KrishnaswamyLab/MELD. 22As single-cell RNA-sequencing (scRNA-seq) has become more accessible, the design of single-cell exper-24 iments has become increasingly complex. Researchers regularly use scRNA-seq to quantify the effect of 25 a drug, gene knockout, or other experimental perturbation on a biological system. However, quantifying 26 the compositional differences between single-cell datasets collected from multiple experimental conditions 27 1 remains an analytical challenge [1] because of the heterogeneity and noise in both the data and the effects 28 of a given perturbation. 29 Previous work has shown the utility of modelling the transcriptomic state space as a continuous low-30 dimensional manifold, or set of manifolds, to characterize cellular heterogeneity and dynamic biological 31 processes [2][3][4][5][6][7][8]. In the manifold model, the biologically valid combinations of gene expression are rep-32 resented as a smooth, low-dimensional surface in a high dimensional space, such as a two-dimensional 33 sheet embedded in three dimensions. The main challenge in developing tools to quantify compositional 34 differences between single-cell datasets is that each dataset comprises several intrinsic structures of hetero-35 geneous cells, and the effect of the experimental condition could be diffuse or isolated to particular areas 36 of...
Systematic variation in the methylation of cytosines at CpG sites plays a critical role in early development of humans and other mammals. Of particular interest are regions of differential methylation between parental alleles, as these often dictate monoallelic gene expression, resulting in parent of origin specific control of the embryonic transcriptome and subsequent development, in a phenomenon known as genomic imprinting. Using long-read nanopore sequencing we show that, with an average genomic coverage of ∼10, it is possible to determine both the level of methylation of CpG sites and the haplotype from which each read arises. The long-read property is exploited to characterize, using novel methods, both methylation and haplotype for reads that have reduced basecalling precision compared to Sanger sequencing. We validate the analysis both through comparison of nanopore-derived methylation patterns with those from Reduced Representation Bisulfite Sequencing data and through comparison with previously reported data. Our analysis successfully identifies known imprinting control regions (ICRs) as well as some novel differentially methylated regions which, due to their proximity to hitherto unknown monoallelically expressed genes, may represent new ICRs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.