Post Randomization Methods (PRAM) are among the most popular disclosure limitation techniques for both categorical and continuous data. In the categorical case, given a stochastic matrix M and a specified variable, an individual belonging to category i is changed to category j with probability Mi,j. Every approach to choose the randomization matrix M has to balance between two desiderata: 1) preserving as much statistical information from the raw data as possible; 2) guaranteeing the privacy of individuals in the dataset. This trade-off has generally been shown to be very challenging to solve. In this work, we use recent tools from the computer science literature and propose to choose M as the solution of a constrained maximization problems. Specifically, M is chosen as the solution of a constrained maximization problem, where we maximize the Mutual Information between raw and transformed data, given the constraint that the transformation satisfies the notion of Differential Privacy. For the general Categorical model, it is shown how this maximization problem reduces to a convex linear programming and can be therefore solved with known optimization algorithms.
Feature allocation models generalize species sampling models by allowing every observation to belong to more than one species, now called features. Under the popular Bernoulli product model for feature allocation, given n samples, we study the problem of estimating the missing mass Mn, namely the expected number hitherto unseen features that would be observed if one additional individual was sampled. This is motivated by numerous applied problems where the sampling procedure is expensive, in terms of time and/or financial resources allocated, and further samples can be only motivated by the possibility of recording new unobserved features. We introduce a simple, robust and theoretically sound nonparametric estimatorMn of Mn.Mn turns out to have the same analytic form of the popular Good-Turing estimator of the missing mass in species sampling models, with the difference that the two estimators have different ranges. We show that Mn admits a natural interpretation both as a jackknife estimator and as a nonparametric empirical Bayes estimator, we give provable guarantees for the performance ofMn in terms of minimax rate optimality, and we provide with an interesting connection betweenMn and the Good-Turing estimator for species sampling. Finally, we derive non-asymptotic confidence intervals forMn, which are easily computable and do not rely on any asymptotic approximation. Our approach is illustrated with synthetic data and SNP data from the ENCODE sequencing genome project.
We present a Bayesian nonparametric Poisson factorization model for modeling network data with an unknown and potentially growing number of overlapping communities. The construction is based on completely random measures and allows the number of communities to either increase with the number of nodes at a specified logarithmic or polynomial rate, or be bounded. We develop asymptotics for the number of nodes and the degree distribution of the network and derive a Markov chain Monte Carlo algorithm for targeting the exact posterior distribution for this model. The usefulness of the approach is illustrated on various real networks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.