This resource aims to support practitioners deploying differential privacy in practice. It endeavors to leave the reader with intuition, responsible guidance and case studies of privacy budgets used by organizations today.

Overview

Over the last five years, the use of differential privacy as an output disclosure control for sensitive data releases and queries has substantially grown. This is in part due to the elegant and theoretically robust underpinning of the differential privacy literature, part due to the prevalence of attacks on traditional disclosure techniques, and part due to the adoption of differential privacy by those perceived to set the "gold standard" such as the US Census which acts as a form of social-proof, giving greater confidence to other early adopters.

As a reference, one way to classify the maturity and readiness of a technology in industry is to consider the technology readiness level of the technology. Systems built with differential privacy guarantees can be found between TRL 6-9. In other words, applications of differential privacy have for some industry applications only been demonstrated in relevant domains and while in others contexts there have been actual systems proven in operational environments. As such, finding common ground on privacy deployments appears as an urgent problem for the DP industry.

The purpose of this document is to support the responsible adoption of differential privacy in the industry. Differential privacy, as will be introduced in an upcoming section, is simply a measure of information loss about data subjects or entities. However, there are few guidelines or recommendations for choosing the thresholds of what a reasonable balance between privacy and query accuracy should be. Furthermore, on many scenarios, these thresholds are context-specific and thus, any organization endeavoring to adopt differential privacy in practice will find its selection as extremely important.

In this document, we describe some dimensions with which we can describe applications of differential privacy and label many real world case studies based on the setting they are deployed in and the privacy budgets chosen. While this is not intended to act as an endorsement of any application, we hope that the document will act as a baseline for informed debate, precedence and eventually, best practices to emerge from.

Core to this document, is a registry of case studies present at the end. Much of the work of identifying these initial case studies is due to the great prior work from personal blogs , government and NGO guides . Despite pre-existing work, the motivation of this document lies on expanding expand the number and classification of these case studies in an open-source fashion, such that the community as a whole can contribute and shape a shared understanding.

On the other hand, if the reader is interested more in an introduction to differential privacy, there are some excellent resources available such as books/papers , online lecture notes and websites . While this document introduces some of the nomenclature of differential privacy, it is not intended to be a standalone resource and will refer to common techniques and mechanisms with only references where the reader can learn more.

Finally, and importantly, this document is not intended to be static in nature. One core purpose behind the document is to periodically add new case studies, to keep up with the ever evolving practices of industry and government applications and align with guidance from regulators which are expected to be more prevalent in coming years. If you would like to join the authors of this document and support the registry, please head over to the Contribute page.

Official Guidance and Standardization

Before diving into the main document, it is important to note that the two prominent standardization bodies, NIST and ISO/IEC, have been active in providing guidance and standardization in the space of data anonymization, and in particular differential privacy.

ISO/EIC 20889:2018 : This standard by the ISO/IEC focuses broadly on de-identification techniques, including synthetic data and randomization techniques. Despite being a normative standard in part, differential privacy is introduced as a formal privacy measure in the style of an informative standard. Only \(\epsilon\)-differential privacy is considered with Laplace, Gaussian and Exponential mechanisms and the concept of cumulative privacy loss. Interestingly, despite Gaussian noise typically being associated with \((\epsilon, \delta)\)-differential privacy and zero-concentrate differential privacy, as will be introduced in section XXX, these more nuanced privacy models are not defined.

NIST SP 800-226 ipd : The guidance paper extends far beyond ISO/EIC 20889:2018, considering multiple privacy models, considerations with regard to the conversion between privacy models, basic mechanisms, threat models in terms of local and central models and more. This is an excellent resource for understanding the nomenclature, security model and goals of applying differential privacy in practice. Throughout this document we endeavor to align the terminology with the NIST guidance paper, leaving formal definitions to the original source.

While the aforementioned resources are useful, neither explicitly provide guidelines as how to choose reasonable parameterization of differential privacy models in terms of privacy budgets, nor do they point to public benchmarks to help the community arrive at industry norms over the medium to long term. In the case of the ISO/EIC 20889:2018, the definitions are also limited to the most standard case which are often an oversimplification for real world applications. In the course of this document, and where applicable, we will link to the terminology of the standard to provide a level of consistency for the reader.

Introduction to Differential Privacy

Randomized Response Surveys

Before the age of big data and data science, traditional data collection faced the challenge called the evasive answer bias. That is to say people not answering survey questions honestly in fear that their answers may be used against them. Randomized responses emerged in the mid-twentieth century to address this.

Randomized response is a technique to protect the privacy of individuals in surveys. It involves adding local noise, such as flipping a coin multiple times and assigning the responses of the individual based on the coin-flip sequence. In doing so, the responses can be true in expectation but any given response is uncertain. This uncertainty over the response of an individual is one of the first applications of differential privacy, although it was not called as such at the time and the quantification of privacy was simply the weighting of probabilities determined by the mechanism.

An example of using a conditional coin-flip to achieve plausible deniability with a calibrated bias.

This approach of randomizing the output of the answer to a question by a mechanism, a stochastic intervention such as coin flipping, is still the very backbone of differential privacy today.

ε-Differential Privacy

Pure epsilon-differential privacy (\(\epsilon\)-DP) is a mathematical guarantee that allows sharing aggregated statistics about a dataset while protecting the privacy of individuals by adding random noise. In simpler words, it ensures that the outcome of any analysis is nearly the same, regardless of whether any individual's data is either included or removed from the dataset.

Formally, the privacy guarantee is quantified using the privacy parameter \(\epsilon\) (epsilon). A randomized algorithm \(A\) is \(\epsilon\)-differentially private if for all neighboring datasets \(D_1\) and \(D_2\) (differing in at most one element), and for all subsets of outputs \(S \subseteq \text{Range}(M)\)


\[ \Pr[M(D_1) \in S] \leq e^{\epsilon} \cdot \Pr[M(D_2) \in S] \]

This M algorithm will provide a set amount of noise, quantified by \(\epsilon\), which would generate outputs with certain error from the real value, which can be quantified by the following interactive widget.

Randomized Response was ε-Differential Privacy

Despite randomized response surveys predating the introduction of the formal definition of differential privacy by over 40 years, it directly translates to the binary mechanism in modern differential privacy.

Assume you wish to set up the spinner originally proposed in to achieve \(\epsilon\)-differential privacy, we can do so by asking participants to tell the truth with probability \(\frac{e^{\frac{\epsilon}{2}}}{1 + e^{\frac{\epsilon}{2}}}\). This is called the binary mechanism in the literature.

This mechanism is incredibly useful to build intuition among a non-technical audience. The most direct question we can pose a data subject from a dataset as simply "Is Alice in this dataset?". Answering the question with different levels of privacy \(\epsilon\) would yield different probabilities of telling the truth, which we display as follows.

\(\epsilon\) Probability of Truth Odds of Truth

While the above odds are just an illustrative example, it brings home the impact of what epsilon actually means in terms of the more intuitive original randomized response. As a reference, theorists often advocate for \(\epsilon \approx 1\) for differential privacy guarantees to have a meaningful privacy assurance.

Intuition of the Laplace Mechanism

One of the most ubiquitous mechanisms in ε-differential privacy is the Laplace mechanism. It is used when we are adding bounded values together such as counts or summations of private values, provided the extreme values (usually referenced as bounds) of the private values are known and hence the maximum contribution of any data subject is bounded.

Essentially, the sum is calculated and then a draw from Laplace distribution is made. Finally, the resulting random variable is added to the original result. Assuming you are counting, such that all values range in \((0, 1)\), then the widget below shows you how the distribution of noise and expected error changes with varying \(\epsilon\).

Note that the error is additive and so we can make claims about the absolute error, but not the relative error of the final stochastic result.

(ε, δ)-Differential Privacy

(ε, δ)-differential privacy is a mathematical guarantee that extends the concept of pure epsilon-differential privacy by allowing for a small probability of failure, with a second privacy parameter \(\delta\). Just as we described pure-DP in our previous section, it also ensures that the outcome of any analysis is nearly the same, regardless of whether any individual's data is present, but further includes a small allowance for a cryptographically small chance of error.

Formally, the privacy guarantee is now quantified using both \(\epsilon\) (epsilon) and also \(\delta\) (delta). A randomized algorithm \(M\) is \((\epsilon, \delta)\)-differentially private if for all neighboring datasets \(D_1\) and \(D_2\) (differing in at most one element), and for all subsets of outputs \(S \subseteq \text{Range}(M)\)


\[ \Pr[M(D_1) \in S] \leq e^{\epsilon} \cdot \Pr[M(D_2) \in S] + \delta \]

The following widget describes the expected error for noise added under \((\epsilon, \delta)\)-DP.

Intuition of (ε, δ)-Differential Privacy

Zero Concentrate Differential Privacy

Zero Concentrate Differential Privacy (also known as zCDP) introduces a parameter \(\rho\) (rho) to measure the concentration of privacy loss around its expected value, allowing for more accurate control of privacy degradation in repeated analyses. As such, we would benefit from using zCDP in applications requiring multiple queries or iterative data use, some of which we will describe in future sections related to interactive and periodical releases.


Formally, a randomized algorithm \(M\) satisfies \(\rho\)-zCDP such that for neighboring datasets \(D_1\) and \(D_2\) (differing in at most one element), and for all \(\alpha\) in (1, ∞), the following holds:

\[ D_{\alpha}(M(D_1) \parallel M(D_2)) \leq \rho \alpha \]

Defining the Trust Model

The Local Model

The local model in differential privacy, as defined in the ISO/IEC , is a threat model that provides strong privacy guarantees before data is collected by a central entity. In this model, each user adds noise to their own data locally (for example, in their own phone or laptop, before it is sent to a processing server). This ensures their privacy is protected even if the data is intercepted in transit or in the case they do not relay trust on the central curator.

Since the noise is added very early in the pipeline, local differential privacy trades off usability and accuracy for stronger individual privacy guarantees. This means that while each user's data is protected even before it reaches the central server, the aggregated results might be less accurate compared to global differential privacy where noise is added after data aggregation.

In local differential privacy, each data subject applies randomization as a disclosure control locally before sharing their outputs with the central aggregator.

The Central Model

Opposite of the previous section, the central model refers to the model where the privacy mechanisms are applied centrally, after data collection. In this model, individuals provide raw data and relay their trust in the curator, which is intended to add privacy protections in a downstream task. This is often referred to as the global model or the server model, as defined in the ISO/IEC .

In global differential privacy, each data subject shares their private information with the trusted aggregator. Randomization is applied as a disclosure control prior to broader dissemination.

Trusted vs Adversarial Curator

When we define a threat model, we mainly focus on how much trust we rely on the curator. A trusted curator is assumed to apply DP in a correct manner, while in contrast, an adversarial curator may (and we assume it always does) attempt to breach privacy.

As such, these concepts are strongly related to the locality of our DP model, which we previously defined as local and global DP. For the local DP protocol, we yield no trust to the central curator, thus we can perfectly accept a model where the curator is adversarial, since the privacy guarantees are put in place by the user in a local manner. On the other hand, for global DP, we expect the curator to set these privacy guarantees in place, and as such we relay all the trust in them.

Static vs Interactive Releases

These two concepts refer to how often we publish DP statistics. A static releases involves publishing a single release with no further interactions, whilst interactive releases repeat these processes, for example, by allowing multiple queries on the dataset. Static releases are simpler, but interactive releases could offer additional utility, but require a more robust privacy measurement for each query due to composition.

Event, Group and Entity Privacy

An important aspect of differential privacy is defining what it is we are endeavoring to protect. Ultimately, we usually are trying to protect the atomic data subjects of a dataset: people, businesses, entities. However, depending on the dataset itself rows of the data table may refer to different thing and individual subjects may have a causal effect on more than one record.

Event-level privacy, as described in , refers to a dataset where we are protecting the rows of a dataset. Each row might pertain to a single data subject in its whole, or a single event such as a credit card transaction.

Group privacy refers to settings where we have multiple data subjects who are linked in some manner such that we care about hiding the contribution of the group. An example of this might be a household in the setting of a census. Finally, there is entity-level privacy. Similar to group-level privacy, this is when multiple records can be linked to a single entity. An example of this would be credit card transactions. One data subject may have zero or multiple transactions associated with them, thus in order to protect the privacy of the entity we need to limit the effect of all records associated with each entity.

From a technical perspective, the mechanics of the tooling to deal with groups and entities are the same so their terminology is often used interchangeably.

Multiple Parties and Collusions

Involving multiple parties in DP releases requires additional accounting of the privacy budget, and similarly to how we described an adversarial curator, we now focus on defining a group of analysts, which could adversarially collude against the DP release.

As a more practical example, collusion typically refers to an environment where multiple analysts are allowed a set privacy budget, but they collaborate between each other to leverage composition and produce information about the dataset that is protected by a worse epsilon parameter, and thus breaking the intended privacy budget allocated for them.

Periodical Releases

This concept, often also related to the "continual observation" area of study, involves producing multiple differentially private releases for a dataset that is periodically changing. Achieving this can be challenging as each release must be carefully accounted for in the privacy budget, and organizations that allow for DP analysis of continually updated datasets, as some of the ones present in our table, are mindful of setting budgets for both the user and time-level.

Differential Privacy Deployment Registry

Public Registry

The following table presents multiple systems with publicly advertised differential privacy parameteres. We also generated estimates for their equivalent parameteres in other DP variants along with collecting their respective sources. You can click on each entry to display a modal with more detailed information.

Note: Values with (*) where generated by us by using this formula to convert from Pure DP to zCDP and this formula to convert from zCDP to approximate DP, with ε = 1:

Private Usecases

The table below lists private applications of differential privacy where parameters are not publicly disclosed. Click on each entry to view a modal with more detailed information.