The Census Bureau has announced new methods for disclosure control in public use data products. The new approach, known as differential privacy, represents a radical departure from current practice. In its pure form, differential privacy techniques may make the release of useful microdata impossible and limit the utility of tabular small-area data. Adoption of differential privacy will have far-reaching consequences for research. It is likely that scientists, planners, and the public will lose the free access we have enjoyed for six decades to reliable public Census Bureau data describing US social and economic change.
Conducting temporal analysis of census data often requires applying areal interpolation to integrate data that have been spatially aggregated using incompatible zoning systems. This article introduces a method of areal interpolation, target-density weighting (TDW), that is useful for long-term temporal analysis because it requires only readily available historical data and basic geographic information system operations. Then, through regression analysis of a large sample of U.S. census tract data, a model is produced that relates the error in TDW estimates of tract population to four basic properties of tracts. An analysis of model residuals combined with theorized absolute limits on interpolation error yields formulas with which we can compute upper and lower prediction bounds on the population in a tract of one census at the time of a different census. These prediction intervals enable the interpretation of different interpolated estimates with appropriately varying degrees of uncertainty.
Areal interpolation transforms data for a variable of interest from a set of source zones to estimate the same variable's distribution over a set of target zones. One common practice has been to guide interpolation by using ancillary control zones that are related to the variable of interest's spatial distribution. This guidance typically involves using source zone data to estimate the density of the variable of interest within each control zone. This article introduces a novel approach to density estimation, the geographically weighted expectation-maximization (GWEM) algorithm, which combines features of two previously used techniques, the expectation-maximization (EM) algorithm and geographically weighted regression. The EM algorithm provides a framework for incorporating proper constraints on data distributions, and using geographical weighting allows estimated control-zone density ratios to vary spatially. We assess the accuracy of GWEM by applying it with land-use/land-cover ancillary data to population counts from a nationwide sample of 1980 United States census tract pairs. We find that GWEM generally is more accurate in this setting than several previously studied methods. Because target-density weighting (TDW)—using 1970 tract densities to guide interpolation—outperforms GWEM in many cases, we also consider two GWEM-TDW hybrid approaches, and find them to improve estimates substantially.
To measure population changes in areas where census unit boundaries do not align across time, a common approach is to interpolate data from one census’s units to another’s. This article presents a broad assessment of areal interpolation models for estimating counts of 2000 characteristics in 2010 census units throughout the United States. We interpolate from 2000 census block data using 4 types of ancillary data to guide interpolation: 2010 block densities, imperviousness data, road buffers, and water body polygons. We test 8 binary dasymetric (BD) models and 8 target-density weighting (TDW) models, each using a unique combination of the 4 ancillary data types, and derive 2 hybrid models that blend the best-performing BD and TDW models. The most accurate model is a hybrid that generally gives high weight to TDW (allocating 2000 data in proportion to 2010 densities) but gives increasing weight to a BD model (allocating data uniformly within developed land near roads) in proportion to the estimated 2000–2010 rate of change within each block. Although for most 2010 census units, this hybrid model’s estimates differ little from the simplest model’s estimates, there are still many areas where the estimates differ considerably. Estimates from the final model, along with lower and upper bounds for each estimate, are publicly available for over 1,000 population and housing characteristics at 10 geographic levels via the National Historical Geographic Information System (NHGIS – http://nhgis.org).
The U.S. Census Bureau plans to use a new disclosure avoidance technique based on differential privacy to protect respondent confidentiality for the 2020 Decennial Census of Population and Housing. Their new technique injects noise based on a number of parameters into published statistics. While the noise injection does protect respondent confidentiality, it achieves the protection at the cost of less accurate data.To better understand the impact that differential privacy has on accuracy, we compare data from the complete-count 1940 Census with multiple differentially private versions of the same data set. We examine the absolute and relative accuracy of population counts in total and by race for multiple geographic levels, and we compare commonly used measures of residential segregation computed from these data sets. We find that accuracy varies by the global privacy-loss budget and the allocation of the privacy-loss budget to geographic levels (e.g., states, counties, enumeration district) and queries.For measures of segregation, we observe situations where the differentially private data indicate less segregation than the original data and situations where the differentially private data indicate more segregation than the original data. The sensitivity of accuracy to the overall global privacy-loss budget and its allocation highlight the fundamental importance of these policy decisions. Data producers like the U.S. Census Bureau must collaborate with users not only to determine the most useful set of parameters to receive allocations of the privacy-loss budget, but also to provide documentation and tools for users to gauge the reliability and validity of statistics from publicly released data products. If they do not, producers may create statistics that are unusable or misleading for the wide variety of use cases that rely on those statistics.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.