MotivationAlthough seldom acknowledged explicitly, count data generated by sequencing platforms exist as compositions for which the abundance of each component (e.g. gene or transcript) is only coherently interpretable relative to other components within that sample. This property arises from the assay technology itself, whereby the number of counts recorded for each sample is constrained by an arbitrary total sum (i.e. library size). Consequently, sequencing data, as compositional data, exist in a non-Euclidean space that, without normalization or transformation, renders invalid many conventional analyses, including distance measures, correlation coefficients and multivariate statistical models.ResultsThe purpose of this review is to summarize the principles of compositional data analysis (CoDA), provide evidence for why sequencing data are compositional, discuss compositionally valid methods available for analyzing sequencing data, and highlight future directions with regard to this field of study.Supplementary information Supplementary data are available at Bioinformatics online.
Background Next-generation sequencing (NGS) has made it possible to determine the sequence and relative abundance of all nucleotides in a biological or environmental sample. A cornerstone of NGS is the quantification of RNA or DNA presence as counts. However, these counts are not counts per se: their magnitude is determined arbitrarily by the sequencing depth, not by the input material. Consequently, counts must undergo normalization prior to use. Conventional normalization methods require a set of assumptions: they assume that the majority of features are unchanged and that all environments under study have the same carrying capacity for nucleotide synthesis. These assumptions are often untestable and may not hold when heterogeneous samples are compared. Results Methods developed within the field of compositional data analysis offer a general solution that is assumption-free and valid for all data. Herein, we synthesize the extant literature to provide a concise guide on how to apply compositional data analysis to NGS count data. Conclusions In highlighting the limitations of total library size, effective library size, and spike-in normalizations, we propose the log-ratio transformation as a general solution to answer the question, “Relative to some important activity of the cell, what is changing?”
Budding yeast telomeres and cryptic mating-type loci are enriched at the nuclear envelope, forming foci that sequester silent information regulators (SIR factors), much as heterochromatic chromocenters in higher eukaryotes sequester HP1. Here we examine the impact of such subcompartments for regulating transcription genome-wide. We show that the efficiency of subtelomeric reporter gene repression depends not only on the strength of SIR factor recruitment by cisacting elements, but also on the accumulation of SIRs in such perinuclear foci. To monitor the effects of disrupting this subnuclear compartment, we performed microarray analyses under conditions that eliminate telomere anchoring, while preserving SIR complex integrity. We found 60 genes reproducibly misregulated. Among those with increased expression, 22% were within 20 kb of a telomere, confirming that the nuclear envelope (NE) association of telomeres helps repress natural subtelomeric genes. In contrast, loci that were down-regulated were distributed over all chromosomes. Half of this ectopic repression was SIR complex dependent. We conclude that released SIR factors can promiscuously repress transcription at nontelomeric genes despite the presence of ''anti-silencing'' mechanisms. Bioinformatic analysis revealed that promoters bearing the PAC (RNA Polymerase A and C promoters) or Abf1 binding consenses are consistently downregulated by mislocalization of SIR factors. Thus, the normal telomeric sequestration of SIRs both favors subtelomeric repression and prevents promiscuous effects at a distinct subset of promoters. This demonstrates that patterns of gene expression can be regulated by changing the spatial distribution of repetitive DNA sequences that bind repressive factors.
BackgroundGenomic studies of endangered species provide insights into their evolution and demographic history, reveal patterns of genomic erosion that might limit their viability, and offer tools for their effective conservation. The Iberian lynx (Lynx pardinus) is the most endangered felid and a unique example of a species on the brink of extinction.ResultsWe generate the first annotated draft of the Iberian lynx genome and carry out genome-based analyses of lynx demography, evolution, and population genetics. We identify a series of severe population bottlenecks in the history of the Iberian lynx that predate its known demographic decline during the 20th century and have greatly impacted its genome evolution. We observe drastically reduced rates of weak-to-strong substitutions associated with GC-biased gene conversion and increased rates of fixation of transposable elements. We also find multiple signatures of genetic erosion in the two remnant Iberian lynx populations, including a high frequency of potentially deleterious variants and substitutions, as well as the lowest genome-wide genetic diversity reported so far in any species.ConclusionsThe genomic features observed in the Iberian lynx genome may hamper short- and long-term viability through reduced fitness and adaptive potential. The knowledge and resources developed in this study will boost the research on felid evolution and conservation genomics and will benefit the ongoing conservation and management of this emblematic species.Electronic supplementary materialThe online version of this article (doi:10.1186/s13059-016-1090-1) contains supplementary material, which is available to authorized users.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.