We present a Bayesian model for estimating the joint distribution of multivariate categorical data when units are nested within groups. Such data arise frequently in social science settings, for example, people living in households. The model assumes that (i) each group is a member of a group-level latent class, and (ii) each unit is a member of a unit-level latent class nested within its group-level latent class. This structure allows the model to capture dependence among units in the same group. It also facilitates simultaneous modeling of variables at both group and unit levels. We develop a version of the model that assigns zero probability to groups and units with physically impossible combinations of variables. We apply the model to estimate multivariate relationships in a subset of the American Community Survey. Using the estimated model, we generate synthetic household data that could be disseminated as redacted public use files. Supplementary materials for this article are available online.
Summary
Statistical agencies are increasingly adopting synthetic data methods for disseminating microdata without compromising the privacy of respondents. Crucial to the implementation of these approaches are flexible models, able to capture the nuances of the multivariate structure in the original data. In the case of multivariate categorical data, preserving this multivariate structure also often involves satisfying constraints in the form of combinations of responses that cannot logically be present in any data set—like married toddlers or pregnant men—also known as structural zeros. Ignoring structural zeros can result in both logically inconsistent synthetic data and biased estimates. Here we propose the use of a Bayesian non‐parametric method for generating discrete multivariate synthetic data subject to structural zeros. This method can preserve complex multivariate relationships between variables, can be applied to high dimensional data sets with massive collections of structural zeros, requires minimal tuning from the user and is computationally efficient. We demonstrate our approach by synthesizing an extract of 17 variables from the 2000 US census. Our method produces synthetic samples with high analytic utility and low disclosure risk.
In this paper we investigate if generating synthetic data can be a viable strategy for providing access to detailed geocoding information for external researchers without compromising the confidentiality of the units included in the database. This research was motivated by a recent project at the Institute for Employment Research (IAB) in Germany that linked exact geocodes to the Integrated Employment Biographies, a large administrative database containing several million records. Based on these data we evaluate the performance of several synthesizers in terms of addressing the trade-off between preserving analytical validity and limiting the risk of disclosure. We propose strategies for making the synthesizers scalable for such large files, present analytical validity measures for the generated data and provide general recommendations for statistical agencies considering the synthetic data 1
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.