Hypothesis testing is one of the most common types of data analysis and forms the backbone of scientific research in many disciplines. Analysis of variance (ANOVA) in particular is used to detect dependence between a categorical and a numerical variable. Here we show how one can carry out this hypothesis test under the restrictions of differential privacy. We show that the F -statistic, the optimal test statistic in the public setting, is no longer optimal in the private setting, and we develop a new test statistic F 1 with much higher statistical power. We show how to rigorously compute a reference distribution for the F 1 statistic and give an algorithm that outputs accurate p-values. We implement our test and experimentally optimize several parameters. We then compare our test to the only previous work on private ANOVA testing, using the same effect size as that work. We see an order of magnitude improvement, with our test requiring only 7% as much data to detect the effect. * Corresponding authors.Differentially Private ANOVA that this gene must indeed affect the given health outcome. (For more detail on how ANOVA is used in this setting, see [12].)The analysis described above assumes that the researcher has full access to the database. However, there are many settings in medicine, psychology, education, and economics (not to mention private-sector data analysis) where the database is not available to the analyst due to privacy concerns. A well-established solution is to allow the researcher to issue queries to the data which are proven to satisfy differential privacy. Differential privacy requires the addition of random noise to statistical queries and guarantees that the results reveal very little about any individual's data.In this paper we propose a new statistic for ANOVA, called F 1 , that is specifically tailored to the differentially private setting. This statistic measures the same variations as the F statistic, but uses |a − b| instead of (a − b) 2 to measure the distance between a and b. In the public setting the F 1 is a worse test statistic than the traditional F -statistic, but we show that in the private setting it has much higher power than the previously published differentially private F -statistic. That is, we show that it can detect effects with a little as 7% of the data that was previously required. (In one example, an effect that took 5300 data points to detect 90% of the time with = 1 in the prior work takes only 350 data points to detect using our new hypothesis test.)
Contributions and organizationWe first review differential privacy, hypothesis testing, and the body of work that lies at the intersection of the two fields (Section 2). In Section 3 we then present a new test statistic, F 1 , for ANOVA in the private setting. While there is some work on differentially private hypothesis testing, designing a new test statistic explicitly tailored for compatibility with differential privacy has been done by few others [14].