2018
DOI: 10.1111/rssa.12358
|View full text |Cite
|
Sign up to set email alerts
|

General and Specific Utility Measures for Synthetic Data

Abstract: Summary. Data holders can produce synthetic versions of data sets when concerns about potential disclosure restrict the availability of the original records. The paper is concerned with methods to judge whether such synthetic data have a distribution that is comparable with that of the original data: what we term general utility. We consider how general utility compares with specific utility: the similarity of results of analyses from the synthetic data and the original data. We adapt a previous general measur… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
121
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
6
1
1

Relationship

1
7

Authors

Journals

citations
Cited by 137 publications
(121 citation statements)
references
References 25 publications
0
121
0
Order By: Relevance
“…While other machine learning models have been shown to outperform CART in many applications, we would have a far weaker bound on the sensitivity and would need to add much more noise. Secondly, as was shown in Snoke et al (2018), CART models exhibit at least satisfactory performance in determining the distributional similarity. Future work may prove desirable bounds on the sensitivity of the pMSE when using stronger classifiers, in which case those models should certainly be adopted.…”
Section: Estimating the Pmse Using Classification And Regression Treementioning
confidence: 72%
“…While other machine learning models have been shown to outperform CART in many applications, we would have a far weaker bound on the sensitivity and would need to add much more noise. Secondly, as was shown in Snoke et al (2018), CART models exhibit at least satisfactory performance in determining the distributional similarity. Future work may prove desirable bounds on the sensitivity of the pMSE when using stronger classifiers, in which case those models should certainly be adopted.…”
Section: Estimating the Pmse Using Classification And Regression Treementioning
confidence: 72%
“…To provide a more comprehensive measure of quality of the synthetic data relative to the confidential data, we compute the pMSE (propensity score mean-squared error, Woo et al, 2009;Snoke et al, 2018b;Snoke et al, 2018a): the mean-squared error of the predicted probabilities (i.e., propensity scores) for those two databases. Specifically, pMSE is a metric to assess how well we are able to discern the high distributional similarity between synthetic data and confidential data.…”
Section: Measuring Outcomesmentioning
confidence: 99%
“…Second, we assess whether measures of economic growth vary between both data sets using dynamic panel data models. Finally, to assess the analytical validity from a more general perspective, we compute global validity measures based on the ideas of propensity score matching as proposed by Woo et al (2009) and Snoke et al (2018a).…”
Section: Introductionmentioning
confidence: 99%
“…There are two broad approaches for assessing the utility of a synthesized dataset: general utility and specific utility (Snoke et al, 2018). General utility reflects the overall similarities in the statistical properties and multivariate relationships between the synthetic and original datasets.…”
mentioning
confidence: 99%
“…Visualizing bivariate comparisons between specific variables of interest is also recommended, as two datasets might have similar statistical properties despite different distributions (e.g., Anscombe's quartet;Anscombe, 1973). Confirming general utility is a necessary step for making inferences from the synthetic dataset and especially important for data exploration in the synthetic dataset (Snoke et al, 2018).…”
mentioning
confidence: 99%