Many complex systems are modular. Such systems can be represented as "component systems," i.e., sets of elementary components, such as LEGO bricks in LEGO sets. The bricks found in a LEGO set reflect a target architecture, which can be built following a set-specific list of instructions. In other component systems, instead, the underlying functional design and constraints are not obvious a priori, and their detection is often a challenge of both scientific and practical importance, requiring a clear understanding of component statistics. Importantly, some quantitative invariants appear to be common to many component systems, most notably a common broad distribution of component abundances, which often resembles the well-known Zipf's law. Such "laws" affect in a general and nontrivial way the component statistics, potentially hindering the identification of system-specific functional constraints or generative processes. Here, we specifically focus on the statistics of shared components, i.e., the distribution of the number of components shared by different system realizations, such as the common bricks found in different LEGO sets. To account for the effects of component heterogeneity, we consider a simple null model, which builds system realizations by random draws from a universe of possible components. Under general assumptions on abundance heterogeneity, we provide analytical estimates of component occurrence, which quantify exhaustively the statistics of shared components. Surprisingly, this simple null model can positively explain important features of empirical component-occurrence distributions obtained from large-scale data on bacterial genomes, LEGO sets, and book chapters. Specific architectural features and functional constraints can be detected from occurrence patterns as deviations from these null predictions, as we show for the illustrative case of the "core" genome in bacteria.
Complex natural and technological systems can be considered, on a coarse-grained level, as assemblies of elementary components: for example, genomes as sets of genes or texts as sets of words. On one hand, the joint occurrence of components emerges from architectural and specific constraints in such systems. On the other hand, general regularities may unify different systems, such as the broadly studied Zipf and Heaps laws, respectively concerning the distribution of component frequencies and their number as a function of system size. Dependency structures (i.e., directed networks encoding the dependency relations between the components in a system) were proposed recently as a possible organizing principles underlying some of the regularities observed. However, the consequences of this assumption were explored only in binary component systems, where solely the presence or absence of components is considered, and multiple copies of the same component are not allowed. Here we consider a simple model that generates, from a given ensemble of dependency structures, a statistical ensemble of sets of components, allowing for components to appear with any multiplicity. Our model is a minimal extension that is memoryless and therefore accessible to analytical calculations. A mean-field analytical approach (analogous to the "Zipfian ensemble" in the linguistics literature) captures the relevant laws describing the component statistics as we show by comparison with numerical computations. In particular, we recover a power-law Zipf rank plot, with a set of core components, and a Heaps law displaying three consecutive regimes (linear, sublinear, and saturating) that we characterize quantitatively.
Large scale data on single-cell gene expression have the potential to unravel the specific transcriptional programs of different cell types. The structure of these expression datasets suggests a similarity with several other complex systems that can be analogously described through the statistics of their basic building blocks. Transcriptomes of single cells are collections of messenger RNA abundances transcribed from a common set of genes just as books are different collections of words from a shared vocabulary, genomes of different species are specific compositions of genes belonging to evolutionary families, and ecological niches can be described by their species abundances. Following this analogy, we identify several emergent statistical laws in single-cell transcriptomic data closely similar to regularities found in linguistics, ecology or genomics. A simple mathematical framework can be used to analyze the relations between different laws and the possible mechanisms behind their ubiquity. Importantly, treatable statistical models can be useful tools in transcriptomics to disentangle the actual biological variability from general statistical effects present in most component systems and from the consequences of the sampling process inherent to the experimental technique.
Zipf's law is a hallmark of several complex systems with a modular structure, such as books composed by words or genomes composed by genes. In these component systems, Zipf's law describes the empirical power law distribution of component frequencies. Stochastic processes based on a sample-space-reducing (SSR) mechanism, in which the number of accessible states reduces as the system evolves, have been recently proposed as a simple explanation for the ubiquitous emergence of this law. However, many complex component systems are characterized by other statistical patterns beyond Zipf's law, such as a sublinear growth of the component vocabulary with the system size, known as Heap's law, and a specific statistics of shared components. This work shows, with analytical calculations and simulations, that these statistical properties can emerge jointly from a SSR mechanism, thus making it an appropriate parameter-poor representation for component systems. Several alternative (and equally simple) models, for example based on the preferential attachment mechanism, can also reproduce Heaps' and Zipf's laws, suggesting that additional statistical properties should be taken into account to select the most-likely generative process for a specific system. Along this line, we will show that the temporal component distribution predicted by the SSR model is markedly different from the one emerging from the popular rich-gets-richer mechanism. A comparison with empirical data from natural language indicates that the SSR process can be chosen as a better candidate model for text generation based on this statistical property. Finally, a limitation of the SSR model in reproducing the empirical "burstiness" of word appearances in texts will be pointed out, thus indicating a possible direction for extensions of the basic SSR process.
High-throughput sequencing of T- and B-cell receptors makes it possible to track immune repertoires across time, in different tissues, in acute and chronic diseases and in healthy individuals. However, quantitative comparison between repertoires is confounded by variability in the read count of each receptor clonotype due to sampling, library preparation, and expression noise. We review methods for accounting for both biological and experimental noise and present an easy-to-use python package NoisET that implements and generalizes a previously developed Bayesian method. It can be used to learn experimental noise models for repertoire sequencing from replicates, and to detect responding clones following a stimulus. We test the package on different repertoire sequencing technologies and data sets. We review how such approaches have been used to identify responding clonotypes in vaccination and disease data. Availability: NoisET is freely available to use with source code at github.com/statbiophys/NoisET.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.