15The availability of gene expression data has dramatically increased in recent years. This data deluge could result in detailed inference of underlying regulatory networks, but the diversity of experimental platforms and protocols introduces critical biases that could hinder scalable analysis of existing data. Here, we show that the underlying structure of the E. coli transcriptome, as determined by Independent Component Analysis (ICA), is conserved across multiple independent 20 datasets, including both RNA-seq and microarray datasets. We also show that echoes of this structure remain in the proteome, accelerating biological discovery through multi-omics analysis.We subsequently combined five transcriptomics datasets into a large compendium containing over 800 expression profiles and discovered that its underlying ICA-based structure was still comparable to that of the individual datasets. ICA thus enables deep analysis of disparate data 25 to uncover new insights that were not visible in the individual datasets.Publicly available datasets, such as the NCBI Gene Expression Omnibus (GEO) 1 and Array Express 2 , contain thousands of transcriptomics datasets that are often designed and 30 analyzed for a specific study. Historically, microarrays were the platform of choice for transcriptomic interrogation. Over the past decade, usage of RNA sequencing (RNA-seq) has surpassed microarrays due to its higher sensitivity and ability to detect new transcripts 3 . Public repositories for proteomics datasets have introduced additional reusable expression datasets 4,5 .
35Multiple consortia have performed extensive comparisons of expression levels across different microarray and RNA-seq platforms 6-8 . These studies showed that absolute gene expression levels cannot be accurately measured by either expression profiling technique, whereas relative abundances are consistent across a wide range of transcriptomics platforms, with appropriate quality controls. However, transcript levels alone cannot predict protein 40 expression levels 9 To further complicate matters, batch effects and technical heterogeneity continue to present significant challenges to successful integration of omics datasets 10 . Differential expression analysis is the most common analytical method applied to transcriptomics datasets. However, differential expression analysis is limited in dimensionality, 45 interpretability, and reproducibility; it can only be applied to pairs of experimental conditions, requires additional analysis to interpret large swaths of differentially expressed genes 11,12 , and is highly dependent on the quantification pipeline 13,14 . Alternatively, machine learning methods, especially matrix factorization 15,16 , have provided new tools for extracting low-dimensional biological information from large omics data. 50In particular, independent component analysis (ICA) has been shown to extract biologically significant gene sets from many transcriptomics datasets 17-22 and two proteomics datasets 23 . ICA outperformed 42 module detect...