A Conversation About Differential Privacy

The U.S. Census Bureau is modernizing its methods for avoiding the disclosure of personal information collected through the decennial census. One of the new approaches, Differential Privacy, has generated some confusion and concern among census data users. Read on to learn more.

What is Differential Privacy (DP)?

How will DP affect the 2020 U.S. Census?

Should we be taking some sort of action?

What is Differential Privacy (DP)?

Differential Privacy theory is concerned with preserving the confidentiality and utility of databases filled with personal information. DP applications use mathematical formulas to alter data in ways that allow analysis and tabulation of statistics without revealing personal information about individuals.

What constitutes personal information?

Personal information includes age, race, gender, marital status, residence, citizenship, name, and other physical, social, financial, and legal characteristics of a particular person. Characteristics that might be observed or voluntarily shared elsewhere may still be considered personal information.

Is DP intended to prevent “hacking” of personal information?

Not exactly. Hacking typically involves unauthorized access to information that wasn’t meant for publication or sharing. DP is concerned with a different type of privacy. DP protects privacy by preventing bits of information that are published from being linked like puzzle pieces with other bits that, together, can reveal the identity and characteristics of a specific individual.

How is “privacy” defined under DP?

In a DP sense, privacy is achieved if statistical findings from analyzing a database would be unchanged by the inclusion or exclusion of a particular individual's data.

How does DP work in practice?

A DP framework typically involves adding and subtracting random numbers, mostly small, to actual values from a data set. Introducing this random “noise” helps to obscure the original data enough to prevent analysis that might identify or link back to specific individuals.

Wouldn’t adding random noise reduce the accuracy of the data?

The goal of a DP framework is to introduce just enough noise to protect privacy while retaining enough accuracy for reliable data analysis.

Accuracy gains cause privacy losses, and vice versa. Striking a desirable balance between privacy and accuracy requires policy deliberation and declarations. The mechanics of DP help to translate those decisions from the policy realm to an applied, mathematical realm.

Isn’t accuracy of paramount importance when it comes to government statistics?

Statistical agencies such as the U.S. Census Bureau operate under layers of federal statutes and regulations that may not explicitly prioritize or define data accuracy with respect to privacy. In the case of the Census Bureau, Title 13 of the United States Code details the agency’s obligation to provide an accurate count of the population but also mandates that the agency preserve the confidentiality of the information it collects.

How will DP affect the 2020 Census?

The U.S. Census Bureau plans to deploy DP protections that will affect most tabulations of 2020 Census results, including total population counts for all geographic areas below the statewide level.

The Census Bureau also plans to reduce the number of published 2020 Census data tables compared to prior censuses.

How can publishing the total population by geographic area possibly expose the identities of residents?

With modern data science and computing technologies, even seemingly harmless population tabulations may now be combined with other public and private data sets to re-identify individual respondents.

The Census Bureau has applied such techniques to its own data products to test for vulnerabilities in its disclosure avoidance systems. Based on this testing, the Census Bureau has concluded that the privacy protection methods used for decennial census data in the in the past will no longer fulfill its obligations under Title 13.

How much will the published Census data differ from actual values?

This is still unknown, but the answer will vary across different types of data tables. In general, we expect relatively greater distortions in more detailed tabulations, smaller geographic areas, and smaller population groups.

Recently released demonstration data sets based on 2010 Census data can help us gauge the potential effects of DP for different geographic area types and population groups, including cities and counties in Iowa.

Will 2020 census data be useful for anything beyond reapportionment?

It is quite possible that some former applications of decennial census data may no longer be practical or advisable. Other uses (and users) of the data may experience little impact from DP.

Early results from testing of the Census Bureau’s methodologies alarmed many data users; however, the Census Bureau continues to refine its methods and incorporate feedback from stakeholders and outside experts.

Much depends on the Census Bureau’s ability to address errors introduced in the “post-processing” phase of data tabulation.

What is “Post-Processing Error?”

After introducing random noise to the census counts through the DP framework, additional adjustments are required to meet traditional expectations of data users. For example, noise injection can result in negative population values for smaller geographic areas or population groups. To eliminate the negative values, population must be reallocated from other areas or groups. Similar steps are required to restore key relationships in the data so that values for population subgroups sum to reported totals. These rebalancing efforts may widen discrepancies from the true census counts. More concerning, the adjustments may also introduce unintentional but systematic biases to the data.

This sounds bad. They won’t actually implement DP, will they?

The Census Bureau has committed to, and invested heavily in, the development of DP protections for 2020 Census data.

Is the implementation of DP politically motivated?

The Census Bureau’s privacy obligations were formalized with passage of Title 13 of the U.S. Code in 1954. Plans to introduce DP methods into census data programs began more than a decade ago. Given this history, there is no reason to suspect the implementation of DP is motivated by anything other than privacy protections.

Will the implementation of DP have political consequences?

While the motivation for DP was apolitical, its implementation will almost certainly have political ramifications. Of particular concern is how DP will impact the accuracy of counts for minority population groups, rural areas, and smaller cities.

Another likely source of contention involves policy decisions surrounding the “privacy loss budget.” The privacy loss budget determines both the amount and the allocation of noise introduced into the data. Decisions about the privacy loss budget require prioritization of data uses and tradeoffs among data users. As awareness of DP grows, these decisions could become politicized.

The timeline for completing the 2020 Census is emerging as yet another politically charged issue. Without an extension of current deadlines, very little time remains for the transparent resolution of problems with post-processing error, privacy loss budget decisions, and other issues.

Should we be taking some sort of action?

Learn what you can about DP.
Assess the ways in which less accurate Census data might affect your organization.
Submit comments to the Census Bureau explaining the consequences of lost accuracy in the data products you typically use.
Start planning now for ways to adapt in a probable new era of less detailed Census data.

Where can I learn more about issues related to DP and the 2020 Census?

A list of readings and video lectures, for both general and technical audiences, is available here.