While measurement advances now allow extensive surveys of gene activity (large numbers of genes across many samples), interpretation of these data is often confounded by noise -expression counts can differ strongly across samples due to variation of both biological and experimental origin. Complimentary to perturbation approaches, we extract functionally related groups of genes by analyzing the standing variation within a sampled population. To distinguish biologically meaningful patterns from uninterpretable noise, we focus on correlated variation and develop a novel density-based clustering approach that takes advantage of a percolation transition generically arising in random, uncorrelated data. We apply our approach to two contrasting RNA sequencing data sets that sample individual variation -across single cells of fission yeast and whole animals of C. elegans worms -and demonstrate robust applicability and versatility in revealing correlated gene clusters of diverse biological origin, including cell cycle phase, development/reproduction, tissue-specific functions, and feeding history. Our technique exploits generic features of noisy high-dimensional data and is applicable, beyond gene expression, to feature-rich data that sample population-level variability in the presence of noise. (180/250) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 standing variation | RNAseq | clustering | random networks | criticality A cornerstone of experimental biology is the perturbation-1 response paradigm, in which targeted manipulations are care-2 fully designed to yield functional and mechanistic insights.3With the recent advent of high-throughput techniques, how-4 ever, the analysis of naturally occurring patterns of variation is 5 emerging as a powerful complementary approach, and has been 6 successfully applied to a variety of problems including protein 7 structure-function mappings (1), gene-network prediction (2), 8 transgenerational memory (3), and aging (4).
9For studies of gene regulatory interactions, a key high-10 throughput technology is RNA sequencing (RNAseq), which 11 allows transcription-level profiling of gene expression on a 12 genome-wide scale. RNAseq experiments conforming to the 13 perturbation-response paradigm -differential analysis of gene 14 expression between manipulated and control conditions -have 15 already transformed our understanding of a wide range of bio-16 logical processes (5-7). With advances in single-cell techniques, 17 RNAseq studies increasingly exploit, beyond perturbation-18 response, information carried by natural variation across in-19 dividuals within unperturbed populations. A major success 20 has been in classifying cells within a heterogeneous population 21 into distinct cell types according to transcriptomic differences 22 (8-13).
23In this study, we address the complementary challenge 24 of identifying the underlying regulatory relationships among 25 genes from the standing variation in expression across sampled 26 individuals. Rather than seeking to fully infer the unde...