The Census Bureau is mandated by the U.S. constitution to complete a count of the population every decade. Few realize however, that Title 13, Sec. 9 of the U.S. Code also requires the Bureau to “keep personally identifiable information confidential for 72 years.”
With the growth of Big Data, this privacy mandate has become a much more complicated task, thanks to “database reconstruction,” a method of partially reconstructing a private dataset from public aggregate information. Consider the well-known example below of how one data scientist obtained former Governor William Weld’s medical history from aggregate data released to the Massachusetts Group Insurance Commission (GIC) for academic research:
“At the time GIC released the data, William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. In response, then-graduate student Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, a city of 54,000 residents and seven ZIP codes. For twenty dollars, she purchased the complete voter rolls from the city of Cambridge, a database containing, among other things, the name, address, ZIP code, birth date, and sex of every voter. By combining this data with the GIC records, Sweeney found Governor Weld with ease. Only six people in Cambridge shared his birth date, only three of them men, and of them, only he lived in his ZIP code. In a theatrical flourish, Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office.”
Nate Anderson, ““Anonymized” data really isn’t—and here’s why not“, ARS Technica (2009).
The Census Bureau’s old techniques for avoiding disclosure of personal data is no match for these reconstruction methods powered by big data and seemingly unlimited computing power. Differential Privacy has been a technique used by many large tech companies to combat this problem, and the Census Bureau is also planning to use Differential Privacy on its data products.
What does this mean? The answer is complicated, but first here is a primer on just what differential privacy is and the trade-offs involved when using it.