A sender wants to accurately convey information to a receiver who has some, possibly related, data. We study the expected number of bits the sender must transmit for one and for multiple instances in two communication scenarios and relate this number to the chromatic and Komer entropies of a naturally defined graph.
For a collection of distributions over a countable support set, the worst case universal compression formulation by Shtarkov attempts to assign a universal distribution over the support set. The formulation aims to ensure that the universal distribution does not underestimate the probability of any element in the support set relative to distributions in the collection. When the alphabet is uncountable and we have a collection P of Lebesgue continuous measures instead, we ask if there is a corresponding universal probability density function (pdf) that does not underestimate the value of the density function at any point in the support relative to pdfs in P. An example of such a measure class is the set of all Gaussian distributions whose mean and variance are in a specified range. We quantify the formulation in the uncountable support case with the attenuation of the class-a quantity analogous to the worst case redundancy of a collection of distributions over a countable alphabet. An attenuation of A implies that the worst case optimal universal pdf at any point x in the support is always at least the value any pdf in the collection P assigns to x divided by A. We analyze the attenuation of the worst optimal universal pdf over length-n samples generated i.i.d. from a Gaussian distribution whose mean can be anywhere between −α/2 to α/2 and variance between σ 2 m and σ 2 M . We show that this attenuation is finite, grows with the number of samples as O(n), and also specify the attentuation exactly without approximations. When only one parameter is allowed to vary, we show that the attenuation grows as O( √ n), again keeping in line with results from prior literature that fix the order of magnitude as a factor of √ n per parameter. In addition, we also specify the attenuation exactly without approximation when only the mean or only the variance is allowed to vary.Keywords: infinitely divisible distributions, universal compression, uncountable support, Gaussians distributions.Compression has been well studied since Shannon [1] formalized not just the notion of what it means to represent data or signals in a compact form, but also quantified how compact the representation can be. For data that come from a countable (discrete) alphabet, this lower bound on compression is essentially the entropy of the source. Furthermore, concrete schemes to represent discrete data in bits are also known-the Huffman coding scheme being the optimal one.While the quantification of the limits of compression is elegant, it does not take into account one of the practicalities of compression-we do not know the underlying distribution. Instead,
Estimating the number of unseen species is an important problem in many scientific endeavors. Its most popular formulation, introduced by Fisher et al. [Fisher RA, Corbet AS, Williams CB (1943) J Animal Ecol 12(1):42−58], uses n samples to predict the number U of hitherto unseen species that would be observed if t · n new samples were collected. Of considerable interest is the largest ratio t between the number of new and existing samples for which U can be accurately predicted. In seminal works, Good and Toulmin [Good I, Toulmin G (1956) Biometrika 43(102):45−63] constructed an intriguing estimator that predicts U for all t ≤ 1. Subsequently, Efron and Thisted [Efron B, Thisted R (1976) Biometrika 63(3):435−447] proposed a modification that empirically predicts U even for some t > 1, but without provable guarantees. We derive a class of estimators that provably predict U all of the way up to t ∝ log n. We also show that this range is the best possible and that the estimator's mean-square error is near optimal for any t. Our approach yields a provable guarantee for the Efron−Thisted estimator and, in addition, a variant with stronger theoretical and experimental performance than existing methodologies on a variety of synthetic and real datasets. The estimators are simple, linear, computationally efficient, and scalable to massive datasets. Their performance guarantees hold uniformly for all distributions, and apply to all four standard sampling models commonly used across various scientific disciplines: multinomial, Poisson, hypergeometric, and Bernoulli product.species estimation | extrapolation model | nonparametric statistics S pecies estimation is an important problem in numerous scientific disciplines. Initially used to estimate ecological diversity (1-4), it was subsequently applied to assess vocabulary size (5, 6), database attribute variation (7), and password innovation (8). Recently, it has found a number of bioscience applications, including estimation of bacterial and microbial diversity (9-12), immune receptor diversity (13), complexity of genomic sequencing (14), and unseen genetic variations (15).All approaches to the problem incorporate a statistical model, with the most popular being the "extrapolation model" introduced by Fisher, Corbet, and Williams (16) in 1943. It assumes that n independent samples X n ≜ X 1 , . . . , X n were collected from an unknown distribution p, and calls for estimatingthe number of hitherto unseen symbols that would be observed if m additional samples X n+m n + 1 ≜ X n+1 , . . . , X n+m were collected from the same distribution.In 1956, Good and Toulmin (17) predicted U by a fascinating estimator that has since intrigued statisticians and a broad range of scientists alike (18). For example, in the Stanford University Statistics Department brochure (19), published in the early 1990s and slightly abbreviated here, Bradley Efron credited the problem and its elegant solution with kindling his interest in statistics. As we shall soon see, Efron, along with Ronald Thisted, ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.