Summary The topological landscape of molecular or functional interaction networks provides a rich source of information for inferring functional patterns of genes or proteins. However, a pressing yet unsolved challenge is how to combine multiple heterogeneous networks, each having different connectivity patterns, to achieve more accurate inference. Here we describe the Mashup framework for scalable and robust network integration. In Mashup, the diffusion in each network is first analyzed to characterize the topological context of each node. Next, the high-dimensional topological patterns in individual networks are canonically represented using low-dimensional vectors, one per gene or protein. These vectors can then be plugged into off-the-shelf machine learning methods to derive functional insights about genes or proteins. We present tools based on Mashup that achieve state-of-the-art performance in three diverse functional inference tasks: protein function prediction, gene ontology reconstruction, and genetic interaction prediction. Mashup enables deeper insights into the structure of rapidly accumulating, diverse biological network data and can be broadly applied to other network science domains.
Most sequenced genomes are currently stored in strict access-controlled repositories1–3. Free access to these data could improve the power of genome-wide association studies (GWAS) to identify disease-causing genetic variants and may aid in the discovery of new drug targets4,5. However, concerns over genetic data privacy6–9 may deter individuals from contributing their genomes to scientific studies10 and in many cases, prevent researchers from sharing data with the scientific community11. Although several cryptographic techniques for secure data analysis exist12–14, none scales to computationally intensive analyses, such as GWAS. Here we describe an end-to-end protocol for large-scale genome-wide analysis that facilitates quality control and population stratification correction in 9K, 13K, and 23K individuals while maintaining the confidentiality of underlying genotypes and phenotypes. We show the protocol could feasibly scale to a million individuals. This approach may help to make currently restricted data available to the scientific community and could potentially enable ‘secure genome crowdsourcing,’ allowing individuals to contribute their genomes to a study without compromising their privacy.
Motivation: Systematically predicting gene (or protein) function based on molecular interaction networks has become an important tool in refining and enhancing the existing annotation catalogs, such as the Gene Ontology (GO) database. However, functional labels with only a few (<10) annotated genes, which constitute about half of the GO terms in yeast, mouse and human, pose a unique challenge in that any prediction algorithm that independently considers each label faces a paucity of information and thus is prone to capture non-generalizable patterns in the data, resulting in poor predictive performance. There exist a variety of algorithms for function prediction, but none properly address this ‘overfitting’ issue of sparsely annotated functions, or do so in a manner scalable to tens of thousands of functions in the human catalog.Results: We propose a novel function prediction algorithm, clusDCA, which transfers information between similar functional labels to alleviate the overfitting problem for sparsely annotated functions. Our method is scalable to datasets with a large number of annotations. In a cross-validation experiment in yeast, mouse and human, our method greatly outperformed previous state-of-the-art function prediction algorithms in predicting sparsely annotated functions, without sacrificing the performance on labels with sufficient information. Furthermore, we show that our method can accurately predict genes that will be assigned a functional label that has no known annotations, based only on the ontology graph structure and genes associated with other labels, which further suggests that our method effectively utilizes the similarity between gene functions.Availability and implementation: https://github.com/wangshenguiuc/clusDCA.Contact: jianpeng@illinois.eduSupplementary information: Supplementary data are available at Bioinformatics online.
Highlights d Method to subsample massive scRNA-seq datasets while preserving rare cell states d Resulting ''sketch'' accelerates clustering, visualization, and integration analyses d Highlighting rare cells helps uncover a rare subtype of inflammatory macrophages d Sketches can boost the utility of single-cell data for labs with limited resources
Nonlinear data-visualization methods, such as t-SNE and UMAP, summarize the complex transcriptomic landscape of single cells in 2D or 3D, but they neglect the local density of data points in the original space, often resulting in misleading visualizations where densely populated subsets of cells are given more visual space than warranted by their transcriptional diversity in the dataset. We present den-SNE and densMAP, density-preserving visualization tools based on t-SNE and UMAP, respectively, and demonstrate their ability to accurately incorporate information about transcriptomic variability into the visual interpretation of single-cell RNA-seq data. Applied to recently published datasets, our methods reveal significant changes in transcriptomic variability in a range of biological processes, including heterogeneity in transcriptomic variability of immune cells in blood and tumor, human immune cell specialization, and the developmental trajectory of C. elegans . Our methods are readily applicable to visualizing high-dimensional data in other scientific domains.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.