The problem of privacy-preserving data analysis has a long history spanning multiple disciplines. As electronic data about individuals becomes increasingly detailed, and as technology enables ever more powerful collection and curation of these data, the need increases for a robust, meaningful, and mathematically rigorous definition of privacy, together with a computationally rich class of algorithms that satisfy this definition. Differential Privacy is such a definition.After motivating and discussing the meaning of differential privacy, the preponderance of this monograph is devoted to fundamental techniques for achieving differential privacy, and application of these techniques in creative combinations, using the query-release problem as an ongoing example. A key point is that, by rethinking the computational goal, one can often obtain far better results than would be achieved by methodically replacing each step of a non-private computation with a differentially private implementation. Despite some astonishingly powerful computational results, there are still fundamental limitationsnot just on what can be achieved with differential privacy but on what can be achieved with any method that protects against a complete breakdown in privacy. Virtually all the algorithms discussed herein maintain differential privacy against adversaries of arbitrary computational power. Certain algorithms are computationally intensive, others are efficient. Computational complexity for the adversary and the algorithm are both discussed.We then turn from fundamentals to applications other than queryrelease, discussing differentially private methods for mechanism design and machine learning. The vast majority of the literature on differentially private algorithms considers a single, static, database that is subject to many analyses. Differential privacy in other models, including distributed databases and computations on data streams is discussed.
Basic TermsThis section motivates and presents the formal definition of differential privacy, and enumerates some of its key properties.
Formalizing differential privacyWe will think of databases x as being collections of records from a universe X . It will often be convenient to represent databases by their histograms: x ∈ N |X | , in which each entry x i represents the number of elements in the database x of type i ∈ X (we abuse notation slightly, letting the symbol N denote the set of all non-negative integers, including zero). In this representation, a natural measure of the distance between two databases x and y will be their 1 distance: Definition 2.3 (Distance Between Databases). The 1 norm of a database x is denoted x 1 and is defined to be:The 1 distance between two databases x and y is x − y 1 Note that x 1 is a measure of the size of a database x (i.e., the number of records it contains), and x − y 1 is a measure of how many records differ between x and y.Databases may also be represented by multisets of rows (elements of X ) or even ordered lists of rows, which is a special case o...