Multicellular organisms employ concurrent gene regulatory programs to control development and physiology of cells and tissues. The Drosophila melanogaster model system has a remarkable history of revealing the genes and mechanisms underlying fundamental biology yet much remains unclear. In particular, brain xenobiotic protection and endobiotic regulatory systems that require transcriptional coordination across different cell types, operating in parallel with the primary nervous system and metabolic functions of each cell type, are still poorly understood. Here we use the unsupervised machine learning method independent component analysis (ICA) on majority freshfrozen, bulk tissue microarrays to define biologically pertinent gene expression signatures which are sparse, i.e. each involving only a fraction of all fly genes. We optimize the gene expression signature definitions partly through repeated application of a stochastic ICA algorithm to a compendium of 3,346 microarrays from 221 experiments provided by the Drosophila research community. Our optimized ICA model of pan fly gene expression consists of 850 modules of co-regulated genes that map to tissue developmental stages, disease states, cell-autonomous pathways and presumably novel processes. Importantly, we show biologically relevant gene modules expressed at varying amplitudes in whole brain and isolated adult blood-brain barrier cell levels.Thus, whole tissue derived ICA transcriptional signatures that transcend single cell type boundaries provide a window into the transcriptional states of difficult to isolate cell ensembles maintaining delicate brain physiologies. We believe the fly ICA gene expression signatures set, by virtue of the success of ICA at inferring robust often low amplitude patterns across large datasets and the quality of the input samples, to be an important asset for analyzing compendium and newly generated microarray or RNA-seq expression datasets.
Annotating genes with information describing their role in the cell is a fundamental goal in biology, and essential for interpreting data-rich assays such as microarray analysis and RNA-Seq. Gene annotation takes many forms, from Gene Ontology (GO) terms, to tissues or cell types of significant expression, to putative regulatory factors and DNA sequences. Almost invariably in gene databases, annotations are connected to genes by a Boolean relationship, e.g., a GO term either is or isn't associated with a particular gene.While useful for many purposes, Boolean-type annotations fail to capture the varying degrees by which some annotations describe their associated genes and give no indication of the relevance of annotations to cellular logistical activities such as gene expression.We hypothesized that weighted annotations could prove useful for understanding gene function and for interpreting gene expression data, and developed a method to generate these from Boolean annotations and a large compendium of gene expression data. The method uses an independent component analysis-based approach to find gene modules in the compendium, and then assigns gene-specific weights to annotations proportional to the degree to which they are shared among members of the module, with the reasoning that the more an annotation is shared by genes in a module, the more likely it is to be relevant to their function and, therefore, the higher it should be weighted. In this paper, we show that analysis of expression data with module-weighted annotations appears to be more resistant to the confounding effect of gene-gene correlations than non-weighted annotation enrichment analysis, and show several examples in which module-weighted annotations provide biological insights not revealed by Boolean annotations. We also show that application of the method to a simple form of genetic regulatory annotation, . CC-BY 4.0 International license peer-reviewed) is the author/funder. It is made available under a The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/096677 doi: bioRxiv preprint first posted online Dec. 24, 2016; namely, the presence or absence of putative regulatory words (oligonucleotides) in gene promoters, leads to module-weighted words that closely match known regulatory sequences, and that these can be used to quickly determine key regulatory sequences in differential expression data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations鈥揷itations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright 漏 2025 scite LLC. All rights reserved.
Made with 馃挋 for researchers
Part of the Research Solutions Family.