A privacy-constrained information extraction problem is considered where for a pair of correlated discrete random variables (X, Y) governed by a given joint distribution, an agent observes Y and wants to convey to a potentially public user as much information about Y as possible while limiting the amount of information revealed about X. To this end, the so-called rate-privacy function is investigated to quantify the maximal amount of information (measured in terms of mutual information) that can be extracted from Y under a privacy constraint between X and the extracted information, where privacy is measured using either mutual information or maximal correlation. Properties of the rate-privacy function are analyzed and its information-theoretic and estimation-theoretic interpretations are presented for both the mutual information and maximal correlation privacy measures. It is also shown that the rate-privacy function admits a closed-form expression for a large family of joint distributions of (X, Y). Finally, the rate-privacy function under the mutual information privacy measure is considered for the case where (X, Y) has a joint probability density function by studying the problem where the extracted information is a uniform quantization of Y corrupted by additive Gaussian noise. The asymptotic behavior of the rate-privacy function is studied as the quantization resolution grows without bound and it is observed that not all of the properties of the rate-privacy function carry over from the discrete to the continuous case.
We investigate the problem of estimating a random variable Y under a privacy constraint dictated by another correlated random variable X. When X and Y are discrete, we express the underlying privacy-utility tradeoff in terms of the privacy-constrained guessing probability h(PXY , ε), the maximum probability Pc(Y |Z) of correctly guessing Y given an auxiliary random variable Z, where the maximization is taken over all P Z|Y ensuring that Pc(X|Z) ≤ ε for a given privacy threshold ε ≥ 0. We prove that h(PXY , ·) is concave and piecewise linear, which allows us to derive its expression in closed form for any ε when X and Y are binary. In the non-binary case, we derive h(PXY , ε) in the high utility regime (i.e., for sufficiently large, but nontrivial, values of ε) under the assumption that Y and Z have the same alphabets. We also analyze the privacy-constrained guessing probability for two scenarios in which X, Y and Z are binary vectors. When X and Y are continuous random variables, we formulate the corresponding privacy-utility tradeoff in terms of sENSR(PXY , ε), the smallest normalized minimum mean squared-error (mmse) incurred in estimating Y from a Gaussian perturbation Z. Here the minimization is taken over a family of Gaussian perturbations Z for which the mmse of f (X) given Z is within a factor 1 − ε from the variance of f (X) for any non-constant real-valued function f . We derive tight upper and lower bounds for sENSR when Y is Gaussian. For general absolutely continuous random variables, we obtain a tight lower bound for sENSR(PXY , ε) in the high privacy regime, i.e., for small ε.
Recently, a parameterized class of loss functions called α-loss, α ∈ [1, ∞], has been introduced for classification. This family, which includes the log-loss and the 0-1 loss as special cases, comes with compelling properties including an equivalent margin-based form which is classification-calibrated for all α. We introduce a generalization of this family to the entire range of α ∈ (0, ∞] and establish how the parameter α enables the practitioner to choose among a host of operating conditions that are important in modern machine learning tasks. We prove that smaller α values are more conducive to faster optimization; in fact, α-loss is convex for α ≤ 1 and quasi-convex for α > 1. Moreover, we establish bounds to quantify the degradation of the local-quasi-convexity of the optimization landscape as α increases; we show that this directly translates to a computational slow down. On the other hand, our theoretical results also suggest that larger α values lead to better generalization performance. This is a consequence of the ability of the α-loss to limit the effect of less likely data as α increases from 1, thereby facilitating robustness to outliers and noise in the training data. We provide strong evidence supporting this assertion with several experiments on benchmark datasets that establish the efficacy of α-loss for α > 1 in robustness to errors in the training data. Of equal interest is the fact that, for α < 1, our experiments show that the decreased robustness seems to counteract class imbalances in training data.
Consider a data publishing setting for a dataset composed by both private and non-private features. The publisher uses an empirical distribution, estimated from n i.i.d. samples, to design a privacy mechanism which is applied to new fresh samples afterward. In this paper, we study the discrepancy between the privacy-utility guarantees for the empirical distribution, used to design the privacy mechanism, and those for the true distribution, experienced by the privacy mechanism in practice. We first show that, for any privacy mechanism, these discrepancies vanish at speed O(1/ √ n) with high probability. These bounds follow from our main technical results regarding the Lipschitz continuity of the considered information leakage measures. Then we prove that the optimal privacy mechanisms for the empirical distribution approach the corresponding mechanisms for the true distribution as the sample size n increases, thereby establishing the statistical consistency of the optimal privacy mechanisms. Finally, we introduce and study uniform privacy mechanisms which, by construction, provide privacy to all the distributions within a neighborhood of the estimated distribution and, thereby, guarantee privacy for the true distribution with high probability.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.