A fundamental problem in Bayesian nonparametrics consists of selecting a prior distribution by assuming that the corresponding predictive probabilities obey certain properties. An early discussion of such a problem, although in a parametric framework, dates back to the seminal work by English philosopher W. E. Johnson, who introduced a noteworthy characterization for the predictive probabilities of the symmetric Dirichlet prior distribution. This is typically referred to as Johnson's "sufficientness" postulate. In this paper we review some nonparametric generalizations of Johnson's postulate for a class of nonparametric priors known as species sampling models. In particular we revisit and discuss the "sufficientness" postulate for the two parameter Poisson-Dirichlet prior within the more general framework of Gibbstype priors and their hierarchical generalizations.
ABSTRACT. Let (P 1 , . . . , P J ) denote J populations of animals from distinct regions.A priori, it is unknown which species are present in each region and what are their corresponding frequencies. Species are shared among populations and each species can be present in more than one region with its frequency varying across populations. In this paper we consider the problem of sequentially sampling these populations in order to observe the greatest number of di↵erent species. We adopt a Bayesian nonparametric approach and endow (P 1 , . . . , P J ) with a Hierarchical Pitman-Yor process prior. As a consequence of the hierarchical structure, the J unknown discrete probability measures share the same support, that of their common random base measure. Given this prior choice, we propose a sequential rule that, at every time step, given the information available up to that point, selects the population from which to collect the next observation. Rather than picking the population with the highest posterior estimate of producing a new value, the proposed rule includes a Thompson sampling step to 1 better balance the exploration-exploitation trade-o↵. We also propose an extension of the algorithm to deal with incidence data, where multiple observations are collected in a time period. The performance of the proposed algorithms is assessed through a simulation study and compared to three other strategies. Finally, we compare these algorithms using a dataset of species of trees, collected from di↵erent plots in South America.
We characterize the class of exchangeable feature allocations assigning probability V n,k k l=1 W m l U n−m l to a feature allocation of n individuals, displaying k features with counts (m 1 , . . . , m k ) for these features. Each element of this class is parametrized by a countable matrix V and two sequences U and W of non-negative weights. Moreover, a consistency condition is imposed to guarantee that the distribution for feature allocations of n − 1 individuals is recovered from that of n individuals, when the last individual is integrated out. In Theorem 1.1, we prove that the only members of this class satisfying the consistency condition are mixtures of the Indian Buffet Process over its mass parameter γ and mixtures of the Beta-Bernoulli model over its dimensionality parameter N . Hence, we provide a characterization of these two models as the only, up to randomization of the parameters, consistent exchangeable feature allocations having the required product form.
Post Randomization Methods (PRAM) are among the most popular disclosure limitation techniques for both categorical and continuous data. In the categorical case, given a stochastic matrix M and a specified variable, an individual belonging to category i is changed to category j with probability Mi,j. Every approach to choose the randomization matrix M has to balance between two desiderata: 1) preserving as much statistical information from the raw data as possible; 2) guaranteeing the privacy of individuals in the dataset. This trade-off has generally been shown to be very challenging to solve. In this work, we use recent tools from the computer science literature and propose to choose M as the solution of a constrained maximization problems. Specifically, M is chosen as the solution of a constrained maximization problem, where we maximize the Mutual Information between raw and transformed data, given the constraint that the transformation satisfies the notion of Differential Privacy. For the general Categorical model, it is shown how this maximization problem reduces to a convex linear programming and can be therefore solved with known optimization algorithms.
Feature allocation models generalize species sampling models by allowing every observation to belong to more than one species, now called features. Under the popular Bernoulli product model for feature allocation, given n samples, we study the problem of estimating the missing mass Mn, namely the expected number hitherto unseen features that would be observed if one additional individual was sampled. This is motivated by numerous applied problems where the sampling procedure is expensive, in terms of time and/or financial resources allocated, and further samples can be only motivated by the possibility of recording new unobserved features. We introduce a simple, robust and theoretically sound nonparametric estimatorMn of Mn.Mn turns out to have the same analytic form of the popular Good-Turing estimator of the missing mass in species sampling models, with the difference that the two estimators have different ranges. We show that Mn admits a natural interpretation both as a jackknife estimator and as a nonparametric empirical Bayes estimator, we give provable guarantees for the performance ofMn in terms of minimax rate optimality, and we provide with an interesting connection betweenMn and the Good-Turing estimator for species sampling. Finally, we derive non-asymptotic confidence intervals forMn, which are easily computable and do not rely on any asymptotic approximation. Our approach is illustrated with synthetic data and SNP data from the ENCODE sequencing genome project.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.