Synthesizing geocodes to facilitate access to detailed geographical information in large scale administrative data

Drechsler, Jörg; Hu, Jingchen

doi:10.48550/arxiv.1803.05874

Cited by 4 publications

(12 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One stream of work has treated the geographic location as variable(s) carrying little geographic information, therefore their proposed synthesizers do not incorporate spatial modeling. Wang and Reiter (2012); Drechsler and Hu (2018+) developed CART models (Reiter, 2005c) to synthesize continuous longitude and latitude. In addition, Drechsler and Hu (2018+) combined the continuous longitude and latitude variables into a single categorical geographic variable, and used versions of categorical CART models for its synthesis.…”

Section: Synthesis Of Locationsmentioning

confidence: 99%

“…In addition to the expected match risks, measures such as the true match rate (the percentage of true unique matches among target records) and the false match rate (the percentage of false matches among unique matches) are also useful (Reiter and Mitra, 2009;Drechsler and Reiter, 2010;Hu and Hoshino, 2018;Hu, 2018+;Drechsler and Hu, 2018+).…”

Section: Disclosure Risksmentioning

confidence: 99%

“…and a blocked Gibbs sampler is implemented for the Markov chain Monte Carlo sampling procedure (Ishwaran and James, 2001;Si and Reiter, 2013;Hu et al, 2014;Drechsler and Hu, 2018+).…”

Section: Dirichlet Process Mixtures Of Product Of Multinomials (Dpmpm)mentioning

confidence: 99%

“…the number of records that are uniquely matched among n target records). There are three widely used file-level summaries of identification disclosure probabilities using the notations and definitions given above (Reiter and Mitra, 2009;Drechsler and Reiter, 2010;Hu and Hoshino, 2018;Hu, 2018+;Drechsler and Hu, 2018+).…”

Section: Identification Disclosure Risks 421 Three Summaries Of Ident...mentioning

confidence: 99%

“…Xing (2009);Si and Reiter (2013);Hu et al (2014);Drechsler and Hu (2018+) and set a α = b α = 0.25, and set uninformative priors for θ by a j = 1, • • • , p. We set K = 40 and track the number of occupied latent classes with 95% interval(28, 36), indicating K = 40 is sufficiently large. We generate m = 20 synthetic datasets by using parameters in iterations that are far away from each other to guarantee independence.We label the m = 20 synthetic datasets generated by the DPMPM synthesizer as Z DP M P M .Computation of the DP-areal synthesizer is done using Stan programming language(Stan Development Team, 2016).…”

mentioning

confidence: 99%

See 4 more Smart Citations

Bayesian Data Synthesis and Disclosure Risk Quantification: An Application to the Consumer Expenditure Surveys

Hu,

Savitsky

2018

Preprint

Self Cite

View full text Add to dashboard Cite

The release of synthetic data generated from a model estimated on the data helps statistical agencies disseminate respondent-level data with high utility and privacy protection. Motivated by the challenge of disseminating sensitive variables containing geographic information in the Consumer Expenditure Surveys (CE) at the U.S. Bureau of Labor Statistics, we propose two non-parametric Bayesian models as data synthesizers for the county identifier of each data record: a Bayesian latent class model and a Bayesian areal model. Both data synthesizers use Dirichlet Process priors to cluster observations of similar characteristics and allow borrowing information across observations. We develop innovative disclosure risks measures to quantify 1

show abstract

Section: Synthesis Of Locationsmentioning

confidence: 99%

Section: Disclosure Risksmentioning

confidence: 99%

“…and a blocked Gibbs sampler is implemented for the Markov chain Monte Carlo sampling procedure (Ishwaran and James, 2001;Si and Reiter, 2013;Hu et al, 2014;Drechsler and Hu, 2018+).…”

Section: Dirichlet Process Mixtures Of Product Of Multinomials (Dpmpm)mentioning

confidence: 99%

Section: Identification Disclosure Risks 421 Three Summaries Of Ident...mentioning

confidence: 99%

mentioning

confidence: 99%

See 3 more Smart Citations

Bayesian Data Synthesis and Disclosure Risk Quantification: An Application to the Consumer Expenditure Surveys

Hu,

Savitsky

2018

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

Bayesian Estimation of Attribute and Identification Disclosure Risks in Synthetic Data

2018

Preprint

Self Cite

View full text Add to dashboard Cite

The synthetic data approach to data confidentiality has been actively researched on, and for the past decade or so, a good number of high quality work on developing innovative synthesizers, creating appropriate utility measures and risk measures, among others, have been published. Comparing to a large volume of work on synthesizers development and utility measures creation, measuring risks has overall received less attention. This paper focuses on the detailed construction of some Bayesian methods proposed for estimating disclosure risks in synthetic data. In the processes of presenting attribute and identification disclosure risks evaluation methods, we highlight key steps, emphasize Bayesian thinking, illustrate with real application examples, and discuss challenges and future research directions. We hope to give the readers a comprehensive view of the Bayesian estimation procedures, enable synthetic data researchers and producers to use these procedures to evaluate disclosure risks, and encourage more researchers to work in this important growing field.

show abstract

Risk-Efficient Bayesian Data Synthesis for Privacy Protection

Hu,

Savitsky,

Williams

2019

Preprint

Self Cite

View full text Add to dashboard Cite

High-utility and low-risks synthetic data facilitates microdata dissemination by statistical agencies. In a previous work, we induced privacy protection into any Bayesian data synthesis model by employing a pseudo posterior likelihood that exponentiates each contribution by an observation record-indexed weight ∈ [0, 1], defined to be inversely proportional to the marginal identification risk for that record. The marginal probability of identification risk for a record is composed as the probability that the identity of the record may be disclosed, conditioned on assumptions about a putative intruder's behavior. Relatively risky records with high marginal probabilities of identification risk tend to be isolated from other records. The downweighting of their likelihood contribution will tend to shrink the synthetic data value for those high-risk records, which in turn often tends to increase the isolation of other moderate-risk records. The result is that the identification risk actually increases for some moderate-risk records after risk-weighted pseudo posterior estimation synthesis, compared to

show abstract

Synthesizing geocodes to facilitate access to detailed geographical information in large scale administrative data

Cited by 4 publications

References 24 publications

Bayesian Data Synthesis and Disclosure Risk Quantification: An Application to the Consumer Expenditure Surveys

Bayesian Data Synthesis and Disclosure Risk Quantification: An Application to the Consumer Expenditure Surveys

Bayesian Estimation of Attribute and Identification Disclosure Risks in Synthetic Data

Risk-Efficient Bayesian Data Synthesis for Privacy Protection

Contact Info

Product

Resources

About