The Census Bureau announced today the final framework for implementing its privacy algorithm (aka “disclosure avoidance or “differential privacy“) on the 2020 census data. The specific settings (indicated below) determine the balance between accuracy and privacy in the data. In short, it determines how much statistical “noise” will be injected into the data. This decision comes after several notable criticisms of demonstration data that had been treated with the algorithm.
Quick Info for Statistical Professionals: The latest set of demonstration data was set at an epsilon of 12.2. Today, the bureau’s Data Stewardship Executive Policy Committee chose an epsilon of 19.61. Read the press release below for more details.
Press Release
JUNE 9, 2021 —The U.S. Census Bureau’s Data Stewardship Executive Policy Committee (DSEP) announced it has selected the settings and parameters for the Disclosure Avoidance System (DAS) for the 2020 Census redistricting data (PL-94-171). The DAS uses a mathematical algorithm to ensure that the privacy of individuals is sufficiently protected while maintaining high levels of accuracy in the statistics we produce.
The Census Bureau released the first “beta” version of the DAS in October 2019, and released further demonstration data products in May, September, and November 2020, and in April 2021. During this process, independent experts and stakeholders, along with data users, have provided extensive feedback to help shape each subsequent test product and to inform the decisions.
After reviewing feedback from the data user community regarding the April 2021 demonstration data, the committee approved a revised algorithm that makes notable improvements in the accuracy of the population counts for places, Minor Civil Divisions, American Indian and Alaska Native tribal areas, and for race and ethnicity statistics, and ensures the accuracy of data necessary for redistricting and Voting Rights Act enforcement.
The approved DAS production settings reflect a total privacy-loss budget for the redistricting data product (represented by “ε,” the Greek letter “epsilon”) of ε=19.61, which includes ε=17.14 for the persons file and ε=2.47 for the housing unit data. The increased privacy-loss budget over the levels reflected in the April 2021 demonstration data— which will lead to lower noise infusion than that in the April 2021 demonstration data—was primarily allocated to the total population and race by ethnicity queries at the block group level and above.
Our Disclosure Avoidance team will use these parameters to prepare the TopDown Algorithm for final system integration testing in anticipation of the DAS application phase of our data processing and related quality assurance checks that will begin later this month. The data will be run and quality checked multiple times prior to release, which are yet further steps in the process that will culminate in the states receiving the final redistricting numbers by August 16.
“The decisions strike the best balance between the need to release detailed, usable statistics from the 2020 Census with our statutory responsibility to protect the privacy of individuals’ data,” said Ron Jarmin, acting director of the U.S. Census Bureau. “They were made after many years of research and candid feedback from data users and outside experts – whom we thank for their invaluable input.”
The 2020 DAS algorithm injects carefully calibrated statistical “noise” to obscure individual data responses. The 2010 and other recent censuses also injected statistical noise into the data, but in a less precise and more ad hoc manner, primarily using a data-swapping methodology. Recent research has confirmed that today’s superior computational technologies have rendered the methods used in 2010 and earlier censuses ineffective against reidentification attacks. The Census Bureau’s recent blog, Modernizing Privacy Protections for the 2020 Census: Next Steps, discusses the privacy challenges that led to the change.
The chosen global privacy-loss budget of ε= 19.61 is exponentially higher than the ε= 12.2 budget used in the April 2021 demonstration data. In making its decisions, DSEP gave significant consideration to the feedback we received from our data users who analyzed the April 2021 demonstration data. That feedback, and steps taken to address those comments, include the following:
- Stakeholders identified a regression in the accuracy of data for tribal geographies and other off-spine geographies. The DAS team made changes to the ‘optimized spine’ to address these concerns; those changes were integrated into the spine that was approved by DSEP.
- Stakeholders identified several measures of bias in the summary metrics that they indicated were areas of concern. In particular, stakeholders addressed concerns about both geographic bias (i.e., the accuracy of population counts being different at larger and smaller geographies) and characteristic bias (counts of racially or ethnically diverse geographies being different than more racially or ethnically homogenous areas). The DAS team made changes to the post-processing system parameters to address these concerns; those changes were integrated into the parameters that were approved by DSEP.
- Data users identified a need for more accuracy in race and ethnicity statistics at many levels of geography. The DAS team addressed those concerns by allocating additional privacy-loss budget to the race and ethnicity queries at various levels of geography; those changes were integrated into the global privacy-loss budget and privacy-loss budget allocations that were approved by DSEP.
- Data users identified a need for more accuracy at the place, Minor Civil Division, and tract levels. The DAS team addressed these concerns both through changes to the optimized geographic spine and through allocation of privacy-loss budget; those changes were integrated into the privacy-loss budget allocations and system parameters that were approved by DSEP.
- Data users identified a need for more accurate statistics on occupancy rates at the block group and higher levels of geography. The DAS team addressed those concerns by allocating additional privacy-loss budget to the housing unit data; that change was integrated into the global privacy-loss budget and privacy-loss budget allocations that were approved by DSEP.
These improvements – as well as other adjustments to the system – were then verified against a broad suite of accuracy measures to ensure that they successfully addressed the feedback we received. We are not able to satisfy all stakeholder feedback. For example, some data users recommended nearly perfect accuracy in block-level data, which we are unable to achieve because it would undermine the ability to implement a functional disclosure avoidance system. We are both legally and ethically bound to protect the privacy of the data provided by and on behalf of our respondents.
In September, the Census Bureau anticipates releasing a final set of demonstration data that applies the privacy-loss budget and settings from today’s decisions to the 2010 Census P.L. 94-171 redistricting data. Demonstration data allow data users to compare a DAS-protected version of 2010 Census results with the published 2010 Census results.
The Census Bureau will also release the DAS production code base. This is a benefit of this Census’ algorithm-based system—unlike the confidential swapping methods used in previous Censuses, the 2020 DAS algorithm allows this level of transparency without risking the exposure of protected data.
Details of the settings and technical parameters for the 2020 DAS will be shared in the coming weeks. Background information is available at census.gov.