The recent dramatic increase in online data availability has allowed researchers to explore human culture with unprecedented detail, such as the growth and diversification of language. In particular, it provides statistical tools to explore whether word use is similar across languages, and if so, whether these generic features appear at different scales of language structure. Here we use the Google Books N -grams dataset to analyze the temporal evolution of word usage in several languages. We apply measures proposed recently to study rank dynamics, such as the diversity of N -grams in a given rank, the probability that an N -gram changes rank between successive time intervals, the rank entropy, and the rank complexity. Using different methods, results show that there are generic properties for different languages at different scales, such as a core of words necessary to minimally understand a language. We also propose a null model to explore the relevance of linguistic structure across multiple scales, concluding that N -gram statistics cannot be reduced to word statistics. We expect our results to be useful in improving text prediction algorithms, as well as in shedding light on the large-scale features of language use, beyond linguistic and cultural differences across human populations.
Criticality has been proposed as a mechanism for the emergence of complexity, life, and computation, as it exhibits a balance between robustness and adaptability. In classic models of complex systems where structure and dynamics are considered homogeneous, criticality is restricted to phase transitions, leading either to robust (ordered) or adaptive (chaotic) phases in most of the parameter space. Many real-world complex systems, however, are not homogeneous. Some elements change in time faster than others, with slower elements (usually the most relevant) providing robustness, and faster ones being adaptive. Structural patterns of connectivity are also typically heterogeneous, characterized by few elements with many interactions and most elements with only a few. Here we take a few traditionally homogeneous dynamical models and explore their heterogeneous versions, finding evidence that heterogeneity extends criticality. Thus, parameter fine-tuning is not necessary to reach a phase transition and obtain the benefits of (homogeneous) criticality. Simply adding heterogeneity can extend criticality, making the search/evolution of complex systems faster and more reliable. Our results add theoretical support for the ubiquitous presence of heterogeneity in physical, social, and technological systems, as natural selection can exploit heterogeneity to evolve complexity "for free". In artificial systems and biological design, heterogeneity may also be used to extend the parameter range that allows for criticality.
Most models of complex systems have been homogeneous, i.e., all elements have the same properties (spatial, temporal, structural, functional). However, most natural systems are heterogeneous: few elements are more relevant, larger, stronger, or faster than others. In homogeneous systems, criticality—a balance between change and stability, order and chaos—is usually found for a very narrow region in the parameter space, close to a phase transition. Using random Boolean networks—a general model of discrete dynamical systems—we show that heterogeneity—in time, structure, and function—can broaden additively the parameter region where criticality is found. Moreover, parameter regions where antifragility is found are also increased with heterogeneity. However, maximum antifragility is found for particular parameters in homogeneous networks. Our work suggests that the “optimal” balance between homogeneity and heterogeneity is non-trivial, context-dependent, and in some cases, dynamic.
Statistical linguistics has advanced considerably in recent decades as data has become available. This has allowed researchers to study how statistical properties of languages change over time. In this work, we use data from Twitter to explore English and Spanish considering the rank diversity at different scales: temporal (from 3 to 96 hour intervals), spatial (from 3km to 3000+km radii), and grammatical (from monograms to pentagrams). We find that all three scales are relevant. However, the greatest changes come from variations in the grammatical scale. At the lowest grammatical scale (monograms), the rank diversity curves are most similar, independently on the values of other scales, languages, and countries. As the grammatical scale grows, the rank diversity curves vary more depending on the temporal and spatial scales, as well as on the language and country. We also study the statistics of Twitter-specific tokens: emojis, hashtags, and user mentions. These particular type of tokens show a sigmoid kind of behaviour as a rank diversity function. Our results are helpful to quantify aspects of language statistics that seem universal and what may lead to variations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.