# 差分隐私

Differential privacy is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. The idea behind differential privacy is that if the effect of making an arbitrary single substitution in the database is small enough, the query result cannot be used to infer much about any single individual, and therefore provides privacy. Another way to describe differential privacy is as a constraint on the algorithms used to publish aggregate information about a statistical database which limits the disclosure of private information of records whose information is in the database. For example, differentially private algorithms are used by some government agencies to publish demographic information or other statistical aggregates while ensuring confidentiality of survey responses, and by companies to collect information about user behavior while controlling what is visible even to internal analysts.

Differential privacy is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. The idea behind differential privacy is that if the effect of making an arbitrary single substitution in the database is small enough, the query result cannot be used to infer much about any single individual, and therefore provides privacy. Another way to describe differential privacy is as a constraint on the algorithms used to publish aggregate information about a statistical database which limits the disclosure of private information of records whose information is in the database. For example, differentially private algorithms are used by some government agencies to publish demographic information or other statistical aggregates while ensuring confidentiality of survey responses, and by companies to collect information about user behavior while controlling what is visible even to internal analysts.

Roughly, an algorithm is differentially private if an observer seeing its output cannot tell if a particular individual's information was used in the computation. Differential privacy is often discussed in the context of identifying individuals whose information may be in a database. Although it does not directly refer to identification and reidentification attacks, differentially private algorithms probably resist such attacks.[1]

Roughly, an algorithm is differentially private if an observer seeing its output cannot tell if a particular individual's information was used in the computation. Differential privacy is often discussed in the context of identifying individuals whose information may be in a database. Although it does not directly refer to identification and reidentification attacks, differentially private algorithms probably resist such attacks.

Differential privacy was developed by cryptographers and thus is often associated with cryptography, and draws much of its language from cryptography.

Differential privacy was developed by cryptographers and thus is often associated with cryptography, and draws much of its language from cryptography.

## History

Official statistics organizations are charged with collecting information from individuals or establishments, and publishing aggregate data to serve the public interest. For example, the 1790 United States Census collected information about individuals living in the United States and published tabulations based on sex, age, race, and condition of servitude. Statistical organizations have long collected information under a promise of confidentiality that the information provided will be used for statistical purposes, but that the publications will not produce information that can be traced back to a specific individual or establishment. To accomplish this goal, statistical organizations have long suppressed information in their publications. For example, in a table presenting the sales of each business in a town grouped by business category, a cell that has information from only one company might be suppressed, in order to maintain the confidentiality of that company's specific sales.

Official statistics organizations are charged with collecting information from individuals or establishments, and publishing aggregate data to serve the public interest. For example, the 1790 United States Census collected information about individuals living in the United States and published tabulations based on sex, age, race, and condition of servitude. Statistical organizations have long collected information under a promise of confidentiality that the information provided will be used for statistical purposes, but that the publications will not produce information that can be traced back to a specific individual or establishment. To accomplish this goal, statistical organizations have long suppressed information in their publications. For example, in a table presenting the sales of each business in a town grouped by business category, a cell that has information from only one company might be suppressed, in order to maintain the confidentiality of that company's specific sales.

The adoption of electronic information processing systems by statistical agencies in the 1950s and 1960s dramatically increased the number of tables that a statistical organization could produce and, in so doing, significantly increased the potential for an improper disclosure of confidential information. For example, if a business that had its sales numbers suppressed also had those numbers appear in the total sales of a region, then it might be possible to determine the suppressed value by subtracting the other sales from that total. But there might also be combinations of additions and subtractions that might cause the private information to be revealed. The number of combinations that needed to be checked increases exponentially with the number of publications, and it is potentially unbounded if data users are able to make queries of the statistical database using an interactive query system.

The adoption of electronic information processing systems by statistical agencies in the 1950s and 1960s dramatically increased the number of tables that a statistical organization could produce and, in so doing, significantly increased the potential for an improper disclosure of confidential information. For example, if a business that had its sales numbers suppressed also had those numbers appear in the total sales of a region, then it might be possible to determine the suppressed value by subtracting the other sales from that total. But there might also be combinations of additions and subtractions that might cause the private information to be revealed. The number of combinations that needed to be checked increases exponentially with the number of publications, and it is potentially unbounded if data users are able to make queries of the statistical database using an interactive query system.

1950年代和1960年代，统计机构采用了电子信息处理系统，大大增加了统计机构可以编制的表格数量，从而大大增加了不当披露机密信息的可能性。例如，如果一个企业的销售数字被抑制，这些数字也出现在一个地区的总销售额中，那么可以通过从总销售额中减去其他销售额来确定被抑制的价值。但也可能有增减的组合，这可能导致私人信息被披露。随着出版物数量的增加，需要检查的组合数量呈指数增长，如果数据用户能够使用交互式查询系统对统计数据库进行查询，则可能无限制。

In 1977, Tore Dalenius formalized the mathematics of cell suppression.[2]

In 1977, Tore Dalenius formalized the mathematics of cell suppression.

1977年，托雷 · 达伦纽斯正式确立了细胞抑制的数学模型。

In 1979, Dorothy Denning, Peter J. Denning and Mayer D. Schwartz formalized the concept of a Tracker, an adversary that could learn the confidential contents of a statistical database by creating a series of targeted queries and remembering the results.[3] This and future research showed that privacy properties in a database could only be preserved by considering each new query in light of (possibly all) previous queries. This line of work is sometimes called query privacy, with the final result being that tracking the impact of a query on the privacy of individuals in the database was NP-hard.

In 1979, Dorothy Denning, Peter J. Denning and Mayer D. Schwartz formalized the concept of a Tracker, an adversary that could learn the confidential contents of a statistical database by creating a series of targeted queries and remembering the results. This and future research showed that privacy properties in a database could only be preserved by considering each new query in light of (possibly all) previous queries. This line of work is sometimes called query privacy, with the final result being that tracking the impact of a query on the privacy of individuals in the database was NP-hard.

1979年，多萝西 · 丹宁(Dorothy Denning)、彼得 · j · 丹宁(Peter j. Denning)和迈耶 · d · 施瓦茨(Mayer d. Schwartz)正式提出了“跟踪器”(Tracker)的概念。这个对手可以通过创建一系列有针对性的查询并记住查询结果，来了解统计数据库的机密内容。这项研究和未来的研究表明，数据库中的隐私属性只能通过根据(可能是所有的)以前的查询来考虑每个新的查询来保护。这种工作有时被称为查询隐私，最终的结果是跟踪查询对数据库中个人隐私的影响是 np 难的。

In 2003, Kobbi Nissim and Irit Dinur demonstrated that it is impossible to publish arbitrary queries on a private statistical database without revealing some amount of private information, and that the entire information content of the database can be revealed by publishing the results of a surprisingly small number of random queries—far fewer than was implied by previous work.[4] The general phenomenon is known as the Fundamental Law of Information Recovery, and its key insight, namely that in the most general case, privacy cannot be protected without injecting some amount of noise, led to development of differential privacy.

In 2003, Kobbi Nissim and Irit Dinur demonstrated that it is impossible to publish arbitrary queries on a private statistical database without revealing some amount of private information, and that the entire information content of the database can be revealed by publishing the results of a surprisingly small number of random queries—far fewer than was implied by previous work.Irit Dinur and Kobbi Nissim. 2003. Revealing information while preserving privacy. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PODS '03). ACM, New York, NY, USA, 202–210. The general phenomenon is known as the Fundamental Law of Information Recovery, and its key insight, namely that in the most general case, privacy cannot be protected without injecting some amount of noise, led to development of differential privacy.

2003年，Kobbi Nissim 和 Irit Dinur 证明，在私人统计数据库上发布任意查询而不披露一些私人信息是不可能的，而且通过发布少得惊人的随机查询的结果就可以显示数据库的全部信息内容ー远远少于以前的工作所暗示的内容。伊里特 · 迪努尔和小布 · 尼辛。2003.在保护隐私的同时披露信息。在第二十二届 ACM SIGMOD-SIGACT-SIGART 数据库系统原理研讨会会议录(PODS’03)。ACM，纽约，纽约，美国，202-210。这种普遍现象被称为信息恢复的基本法则，其核心观点是，在最普遍的情况下，如果不注入一些噪音，隐私就无法得到保护，这导致了差分隐私的发展。

In 2006, Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam D. Smith published an article formalizing the amount of noise that needed to be added and proposing a generalized mechanism for doing so.[1] Their work was a co-recipient of the 2016 TCC Test-of-Time Award[5] and the 2017 Gödel Prize.[6]

In 2006, Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam D. Smith published an article formalizing the amount of noise that needed to be added and proposing a generalized mechanism for doing so. Their work was a co-recipient of the 2016 TCC Test-of-Time Award and the 2017 Gödel Prize.

2006年，辛西娅 · 德沃克、弗兰克 · 麦克谢里、科比 · 尼西姆和亚当 · d · 史密斯发表了一篇文章，正式确定了需要增加的噪音量，并提出了一种通用的增加噪音的机制。他们的工作是2016年移动通信协会时间测试奖和2017年哥德尔奖的共同获得者。

Since then, subsequent research has shown that there are many ways to produce very accurate statistics from the database while still ensuring high levels of privacy.[7][8]

Since then, subsequent research has shown that there are many ways to produce very accurate statistics from the database while still ensuring high levels of privacy.

## ε-differential privacy

The 2006 Dwork, McSherry, Nissim and Smith article introduced the concept of ε-differential privacy, a mathematical definition for the privacy loss associated with any data release drawn from a statistical database. (Here, the term statistical database means a set of data that are collected under the pledge of confidentiality for the purpose of producing statistics that, by their production, do not compromise the privacy of those individuals who provided the data.)

The 2006 Dwork, McSherry, Nissim and Smith article introduced the concept of ε-differential privacy, a mathematical definition for the privacy loss associated with any data release drawn from a statistical database. (Here, the term statistical database means a set of data that are collected under the pledge of confidentiality for the purpose of producing statistics that, by their production, do not compromise the privacy of those individuals who provided the data.)

2006年 Dwork、 McSherry、 Nissim 和 Smith 的文章引入了 ε- 差别隐私的概念，这是一个数学定义，用来定义从统计数据库中发布的任何数据所导致的隐私损失。(在这里，统计数据库一词是指根据保密承诺收集的一组数据，目的是编制统计数据，而编制这些数据不会损害提供数据的个人的隐私。)

The intuition for the 2006 definition of ε-differential privacy is that a person's privacy cannot be compromised by a statistical release if their data are not in the database. Therefore, with differential privacy, the goal is to give each individual roughly the same privacy that would result from having their data removed. That is, the statistical functions run on the database should not overly depend on the data of any one individual.

The intuition for the 2006 definition of ε-differential privacy is that a person's privacy cannot be compromised by a statistical release if their data are not in the database. Therefore, with differential privacy, the goal is to give each individual roughly the same privacy that would result from having their data removed. That is, the statistical functions run on the database should not overly depend on the data of any one individual.

2006年对 ε- 差别隐私的定义的直觉是，如果一个人的数据不在数据库中，那么他的隐私就不会因为统计公布而受到损害。因此，差分隐私的目标是给每个人大致相同的隐私，这将导致他们的数据删除。也就是说，在数据库上运行的统计功能不应过分依赖于任何个人的数据。

Of course, how much any individual contributes to the result of a database query depends in part on how many people's data are involved in the query. If the database contains data from a single person, that person's data contributes 100%. If the database contains data from a hundred people, each person's data contributes just 1%. The key insight of differential privacy is that as the query is made on the data of fewer and fewer people, more noise needs to be added to the query result to produce the same amount of privacy. Hence the name of the 2006 paper, "Calibrating noise to sensitivity in private data analysis."

Of course, how much any individual contributes to the result of a database query depends in part on how many people's data are involved in the query. If the database contains data from a single person, that person's data contributes 100%. If the database contains data from a hundred people, each person's data contributes just 1%. The key insight of differential privacy is that as the query is made on the data of fewer and fewer people, more noise needs to be added to the query result to produce the same amount of privacy. Hence the name of the 2006 paper, "Calibrating noise to sensitivity in private data analysis."

The 2006 paper presents both a mathematical definition of differential privacy and a mechanism based on the addition of Laplace noise (i.e. noise coming from the Laplace distribution) that satisfies the definition.

The 2006 paper presents both a mathematical definition of differential privacy and a mechanism based on the addition of Laplace noise (i.e. noise coming from the Laplace distribution) that satisfies the definition.

2006年的论文给出了差分隐私的数学定义，以及基于拉普拉斯噪音(即。拉普拉斯分布发出的噪音)。

### Definition of ε-differential privacy

Let ε be a positive real number and $\displaystyle{ \mathcal{A} }$ be a randomized algorithm that takes a dataset as input (representing the actions of the trusted party holding the data). Let $\displaystyle{ \textrm{im}\ \mathcal{A} }$ denote the image of $\displaystyle{ \mathcal{A} }$. The algorithm $\displaystyle{ \mathcal{A} }$ is said to provide $\displaystyle{ \epsilon }$-differential privacy if, for all datasets $\displaystyle{ D_1 }$ and $\displaystyle{ D_2 }$ that differ on a single element (i.e., the data of one person), and all subsets $\displaystyle{ S }$ of $\displaystyle{ \textrm{im}\ \mathcal{A} }$:

$\displaystyle{ \Pr[\mathcal{A}(D_1) \in S] \leq \exp\left(\epsilon\right) \cdot \Pr[\mathcal{A}(D_2) \in S], }$

where the probability is taken over the randomness used by the algorithm.[9]

Let ε be a positive real number and \mathcal{A} be a randomized algorithm that takes a dataset as input (representing the actions of the trusted party holding the data). Let \textrm{im}\ \mathcal{A} denote the image of \mathcal{A}. The algorithm \mathcal{A} is said to provide \epsilon-differential privacy if, for all datasets D_1 and D_2 that differ on a single element (i.e., the data of one person), and all subsets S of \textrm{im}\ \mathcal{A}:

\Pr[\mathcal{A}(D_1) \in S] \leq \exp\left(\epsilon\right) \cdot \Pr[\mathcal{A}(D_2) \in S],

where the probability is taken over the randomness used by the algorithm.

Differential privacy offers strong and robust guarantees that facilitate modular design and analysis of differentially private mechanisms due to its composability, robustness to post-processing, and graceful degradation in the presence of correlated data.

Differential privacy offers strong and robust guarantees that facilitate modular design and analysis of differentially private mechanisms due to its composability, robustness to post-processing, and graceful degradation in the presence of correlated data.

### Composability

(Self-)composability refers to the fact that the joint distribution of the outputs of (possibly adaptively chosen) differentially private mechanisms satisfies differential privacy.

(Self-)composability refers to the fact that the joint distribution of the outputs of (possibly adaptively chosen) differentially private mechanisms satisfies differential privacy.

(Self -)可组合性是指差异私有机制的输出(可能是自适应选择的)的联合分布满足差分隐私的事实。

Sequential composition. If we query an ε-differential privacy mechanism $\displaystyle{ t }$ times, and the randomization of the mechanism is independent for each query, then the result would be $\displaystyle{ \epsilon t }$-differentially private. In the more general case, if there are $\displaystyle{ n }$ independent mechanisms: $\displaystyle{ \mathcal{M}_1,\dots,\mathcal{M}_n }$, whose privacy guarantees are $\displaystyle{ \epsilon_1,\dots,\epsilon_n }$ differential privacy, respectively, then any function $\displaystyle{ g }$ of them: $\displaystyle{ g(\mathcal{M}_1,\dots,\mathcal{M}_n) }$ is $\displaystyle{ \left(\sum\limits_{i=1}^{n} \epsilon_i\right) }$-differentially private.[10]

Sequential composition. If we query an ε-differential privacy mechanism t times, and the randomization of the mechanism is independent for each query, then the result would be \epsilon t-differentially private. In the more general case, if there are n independent mechanisms: \mathcal{M}_1,\dots,\mathcal{M}_n, whose privacy guarantees are \epsilon_1,\dots,\epsilon_n differential privacy, respectively, then any function g of them: g(\mathcal{M}_1,\dots,\mathcal{M}_n) is \left(\sum\limits_{i=1}^{n} \epsilon_i\right)-differentially private.

Parallel composition. If the previous mechanisms are computed on disjoint subsets of the private database then the function $\displaystyle{ g }$ would be $\displaystyle{ (\max_i \epsilon_i) }$-differentially private instead.[10]

Parallel composition. If the previous mechanisms are computed on disjoint subsets of the private database then the function g would be (\max_i \epsilon_i)-differentially private instead.

### Robustness to post-processing

For any deterministic or randomized function $\displaystyle{ F }$ defined over the image of the mechanism $\displaystyle{ \mathcal{A} }$, if $\displaystyle{ \mathcal{A} }$ satisfies ε-differential privacy, so does $\displaystyle{ F(\mathcal{A}) }$.

For any deterministic or randomized function F defined over the image of the mechanism \mathcal{A}, if \mathcal{A} satisfies ε-differential privacy, so does F(\mathcal{A}).

Together, composability and robustness to post-processing permit modular construction and analysis of differentially private mechanisms and motivate the concept of the privacy loss budget. If all elements that access sensitive data of a complex mechanisms are separately differentially private, so will be their combination, followed by arbitrary post-processing.

Together, composability and robustness to post-processing permit modular construction and analysis of differentially private mechanisms and motivate the concept of the privacy loss budget. If all elements that access sensitive data of a complex mechanisms are separately differentially private, so will be their combination, followed by arbitrary post-processing.

### Group privacy

In general, ε-differential privacy is designed to protect the privacy between neighboring databases which differ only in one row. This means that no adversary with arbitrary auxiliary information can know if one particular participant submitted his information. However this is also extendable if we want to protect databases differing in $\displaystyle{ c }$ rows, which amounts to adversary with arbitrary auxiliary information can know if $\displaystyle{ c }$ particular participants submitted their information. This can be achieved because if $\displaystyle{ c }$ items change, the probability dilation is bounded by $\displaystyle{ \exp ( \epsilon c ) }$ instead of $\displaystyle{ \exp ( \epsilon ) }$,[11] i.e., for D1 and D2 differing on $\displaystyle{ c }$ items:

In general, ε-differential privacy is designed to protect the privacy between neighboring databases which differ only in one row. This means that no adversary with arbitrary auxiliary information can know if one particular participant submitted his information. However this is also extendable if we want to protect databases differing in c rows, which amounts to adversary with arbitrary auxiliary information can know if c particular participants submitted their information. This can be achieved because if c items change, the probability dilation is bounded by \exp ( \epsilon c ) instead of \exp ( \epsilon ), i.e., for D1 and D2 differing on c items:

$\displaystyle{ \Pr[\mathcal{A}(D_{1})\in S]\leq \exp(\epsilon c)\cdot\Pr[\mathcal{A}(D_{2})\in S]\,\! }$
\Pr[\mathcal{A}(D_{1})\in S]\leq

\exp(\epsilon c)\cdot\Pr[\mathcal{A}(D_{2})\in S]\,\!

Pr [ mathcal { a }(d _ {1}) in s ] leq exp (epsilon c) cdot Pr [ mathcal { a }(d _ {2}) in s ] ，！

Thus setting ε instead to $\displaystyle{ \epsilon/c }$ achieves the desired result (protection of $\displaystyle{ c }$ items). In other words, instead of having each item ε-differentially private protected, now every group of $\displaystyle{ c }$ items is ε-differentially private protected (and each item is $\displaystyle{ (\epsilon/c) }$-differentially private protected).

Thus setting ε instead to \epsilon/c achieves the desired result (protection of c items). In other words, instead of having each item ε-differentially private protected, now every group of c items is ε-differentially private protected (and each item is (\epsilon/c)-differentially private protected).

## ε-differentially private mechanisms

Since differential privacy is a probabilistic concept, any differentially private mechanism is necessarily randomized. Some of these, like the Laplace mechanism, described below, rely on adding controlled noise to the function that we want to compute. Others, like the exponential mechanism[12] and posterior sampling[13] sample from a problem-dependent family of distributions instead.

Since differential privacy is a probabilistic concept, any differentially private mechanism is necessarily randomized. Some of these, like the Laplace mechanism, described below, rely on adding controlled noise to the function that we want to compute. Others, like the exponential mechanismF.McSherry and K.Talwar. Mechasim Design via Differential Privacy. Proceedings of the 48th Annual Symposium of Foundations of Computer Science, 2007. and posterior samplingChristos Dimitrakakis, Blaine Nelson, Aikaterini Mitrokotsa, Benjamin Rubinstein. Robust and Private Bayesian Inference. Algorithmic Learning Theory 2014 sample from a problem-dependent family of distributions instead.

### Sensitivity

Let $\displaystyle{ d }$ be a positive integer, $\displaystyle{ \mathcal{D} }$ be a collection of datasets, and $\displaystyle{ f \colon \mathcal{D} \rightarrow \mathbb{R}^d }$ be a function. The sensitivity [1] of a function, denoted $\displaystyle{ \Delta f }$, is defined by

$\displaystyle{ \Delta f=\max \lVert f(D_1)-f(D_2) \rVert_1, }$

where the maximum is over all pairs of datasets $\displaystyle{ D_1 }$ and $\displaystyle{ D_2 }$ in $\displaystyle{ \mathcal{D} }$ differing in at most one element and $\displaystyle{ \lVert \cdot \rVert_1 }$ denotes the $\displaystyle{ \ell_1 }$ norm.

Let d be a positive integer, \mathcal{D} be a collection of datasets, and f \colon \mathcal{D} \rightarrow \mathbb{R}^d be a function. The sensitivity of a function, denoted \Delta f, is defined by

\Delta f=\max \lVert f(D_1)-f(D_2) \rVert_1,

where the maximum is over all pairs of datasets D_1 and D_2 in \mathcal{D} differing in at most one element and \lVert \cdot \rVert_1 denotes the \ell_1 norm.

In the example of the medical database below, if we consider $\displaystyle{ f }$ to be the function $\displaystyle{ Q_i }$, then the sensitivity of the function is one, since changing any one of the entries in the database causes the output of the function to change by either zero or one.

In the example of the medical database below, if we consider f to be the function Q_i, then the sensitivity of the function is one, since changing any one of the entries in the database causes the output of the function to change by either zero or one.

There are techniques (which are described below) using which we can create a differentially private algorithm for functions with low sensitivity.

There are techniques (which are described below) using which we can create a differentially private algorithm for functions with low sensitivity.

### The Laplace mechanism

The Laplace mechanism adds Laplace noise (i.e. noise from the Laplace distribution, which can be expressed by probability density function $\displaystyle{ \text{noise}(y)\propto \exp(-|y|/\lambda)\,\! }$, which has mean zero and standard deviation $\displaystyle{ \sqrt{2} \lambda\,\! }$). Now in our case we define the output function of $\displaystyle{ \mathcal{A}\,\! }$ as a real valued function (called as the transcript output by $\displaystyle{ \mathcal{A}\,\! }$) as $\displaystyle{ \mathcal{T}_{\mathcal{A}}(x)=f(x)+Y\,\! }$ where $\displaystyle{ Y \sim \text{Lap}(\lambda)\,\!\,\! }$ and $\displaystyle{ f\,\! }$ is the original real valued query/function we planned to execute on the database. Now clearly $\displaystyle{ \mathcal{T}_{\mathcal{A}}(x)\,\! }$ can be considered to be a continuous random variable, where

The Laplace mechanism adds Laplace noise (i.e. noise from the Laplace distribution, which can be expressed by probability density function \text{noise}(y)\propto \exp(-|y|/\lambda)\,\!, which has mean zero and standard deviation \sqrt{2} \lambda\,\!). Now in our case we define the output function of \mathcal{A}\,\! as a real valued function (called as the transcript output by \mathcal{A}\,\!) as \mathcal{T}_{\mathcal{A}}(x)=f(x)+Y\,\! where Y \sim \text{Lap}(\lambda)\,\!\,\! and f\,\! is the original real valued query/function we planned to execute on the database. Now clearly \mathcal{T}_{\mathcal{A}}(x)\,\! can be considered to be a continuous random variable, where

$\displaystyle{ \frac{\mathrm{pdf}(\mathcal{T}_{\mathcal{A},D_1}(x)=t)}{\mathrm{pdf}(\mathcal{T}_{\mathcal{A},D_2}(x)=t)}=\frac{\text{noise}(t-f(D_1))}{\text{noise}(t-f(D_2))}\,\! }$

\frac{\mathrm{pdf}(\mathcal{T}_{\mathcal{A},D_1}(x)=t)}{\mathrm{pdf}(\mathcal{T}_{\mathcal{A},D_2}(x)=t)}=\frac{\text{noise}(t-f(D_1))}{\text{noise}(t-f(D_2))}\,\!

frc { mathrm { pdf }(mathcal { t } _ { mathcal { a } ，d _ 1}(x) = t)}{ mathrm { pdf }(mathcal { t } _ { mathcal { a } ，d _ 2}(x) = t)} = frc { text { noise }(t-f (d _ 1))}{ text { noise }(t-f (d _ 2))} ，！

which is at most $\displaystyle{ e^{\frac{|f(D_{1})-f(D_{2})|}{\lambda}}\leq e^{\frac{\Delta(f)}{\lambda}}\,\! }$. We can consider $\displaystyle{ \frac{\Delta(f)}{\lambda}\,\! }$ to be the privacy factor $\displaystyle{ \epsilon\,\! }$. Thus $\displaystyle{ \mathcal{T}\,\! }$ follows a differentially private mechanism (as can be seen from the definition above). If we try to use this concept in our diabetes example then it follows from the above derived fact that in order to have $\displaystyle{ \mathcal{A}\,\! }$ as the $\displaystyle{ \epsilon\,\! }$-differential private algorithm we need to have $\displaystyle{ \lambda=1/\epsilon\,\! }$. Though we have used Laplace noise here, other forms of noise, such as the Gaussian Noise, can be employed, but they may require a slight relaxation of the definition of differential privacy.[11]

which is at most e^{\frac{|f(D_{1})-f(D_{2})|}{\lambda}}\leq e^{\frac{\Delta(f)}{\lambda}}\,\!. We can consider \frac{\Delta(f)}{\lambda}\,\! to be the privacy factor \epsilon\,\!. Thus \mathcal{T}\,\! follows a differentially private mechanism (as can be seen from the definition above). If we try to use this concept in our diabetes example then it follows from the above derived fact that in order to have \mathcal{A}\,\! as the \epsilon\,\!-differential private algorithm we need to have \lambda=1/\epsilon\,\!. Though we have used Laplace noise here, other forms of noise, such as the Gaussian Noise, can be employed, but they may require a slight relaxation of the definition of differential privacy.

According to this definition, differential privacy is a condition on the release mechanism (i.e., the trusted party releasing information about the dataset) and not on the dataset itself. Intuitively, this means that for any two datasets that are similar, a given differentially private algorithm will behave approximately the same on both datasets. The definition gives a strong guarantee that presence or absence of an individual will not affect the final output of the algorithm significantly.

According to this definition, differential privacy is a condition on the release mechanism (i.e., the trusted party releasing information about the dataset) and not on the dataset itself. Intuitively, this means that for any two datasets that are similar, a given differentially private algorithm will behave approximately the same on both datasets. The definition gives a strong guarantee that presence or absence of an individual will not affect the final output of the algorithm significantly.

For example, assume we have a database of medical records $\displaystyle{ D_1 }$ where each record is a pair (Name, X), where $\displaystyle{ X }$ is a Boolean denoting whether a person has diabetes or not. For example:

For example, assume we have a database of medical records D_1 where each record is a pair (Name, X), where X is a Boolean denoting whether a person has diabetes or not. For example:

Name Has Diabetes (X)
Ross 1
Monica 1
Joey 0
Phoebe 0
Chandler 1
Rachel 0
Name Has Diabetes (X)
Ross 1
Monica 1
Joey 0
Phoebe 0
Chandler 1
Rachel 0
-| Ross | | 1 |-| Monica | | | 1 | |-| Joey | | 0 |-| Phoebe | 0 |-|-| Chandler | | 1 |-|-| Rachel | | | | |

Now suppose a malicious user (often termed an adversary) wants to find whether Chandler has diabetes or not. Suppose he also knows in which row of the database Chandler resides. Now suppose the adversary is only allowed to use a particular form of query $\displaystyle{ Q_i }$ that returns the partial sum of the first $\displaystyle{ i }$ rows of column $\displaystyle{ X }$ in the database. In order to find Chandler's diabetes status the adversary executes $\displaystyle{ Q_5(D_1) }$ and $\displaystyle{ Q_4(D_1) }$, then computes their difference. In this example, $\displaystyle{ Q_5(D_1) = 3 }$ and $\displaystyle{ Q_4(D_1) = 2 }$, so their difference is 1. This indicates that the "Has Diabetes" field in Chandler's row must be 1. This example highlights how individual information can be compromised even without explicitly querying for the information of a specific individual.

Now suppose a malicious user (often termed an adversary) wants to find whether Chandler has diabetes or not. Suppose he also knows in which row of the database Chandler resides. Now suppose the adversary is only allowed to use a particular form of query Q_i that returns the partial sum of the first i rows of column X in the database. In order to find Chandler's diabetes status the adversary executes Q_5(D_1) and Q_4(D_1), then computes their difference. In this example, Q_5(D_1) = 3 and Q_4(D_1) = 2, so their difference is 1. This indicates that the "Has Diabetes" field in Chandler's row must be 1. This example highlights how individual information can be compromised even without explicitly querying for the information of a specific individual.

Continuing this example, if we construct $\displaystyle{ D_2 }$ by replacing (Chandler, 1) with (Chandler, 0) then this malicious adversary will be able to distinguish $\displaystyle{ D_2 }$ from $\displaystyle{ D_1 }$ by computing $\displaystyle{ Q_5 - Q_4 }$ for each dataset. If the adversary were required to receive the values $\displaystyle{ Q_i }$ via an $\displaystyle{ \epsilon }$-differentially private algorithm, for a sufficiently small $\displaystyle{ \epsilon }$, then he or she would be unable to distinguish between the two datasets.

Continuing this example, if we construct D_2 by replacing (Chandler, 1) with (Chandler, 0) then this malicious adversary will be able to distinguish D_2 from D_1 by computing Q_5 - Q_4 for each dataset. If the adversary were required to receive the values Q_i via an \epsilon-differentially private algorithm, for a sufficiently small \epsilon, then he or she would be unable to distinguish between the two datasets.

### Randomized response

A simple example, especially developed in the social sciences,[14] is to ask a person to answer the question "Do you own the attribute A?", according to the following procedure:

A simple example, especially developed in the social sciences, is to ask a person to answer the question "Do you own the attribute A?", according to the following procedure:

1. Toss a coin.
2. If heads, then toss the coin again (ignoring the outcome), and answer the question honestly.
3. If tails, then toss the coin again and answer "Yes" if heads, "No" if tails.
1. Toss a coin.
2. If heads, then toss the coin again (ignoring the outcome), and answer the question honestly.
3. If tails, then toss the coin again and answer "Yes" if heads, "No" if tails.
1. 抛硬币。# 如果是正面，再掷硬币(忽略结果) ，诚实地回答问题。# 如果是反面，再掷一次硬币，如果是正面，回答“是”; 如果是反面，回答“否”。

(The seemingly redundant extra toss in the first case is needed in situations where just the act of tossing a coin may be observed by others, even if the actual result stays hidden.) The confidentiality then arises from the refutability of the individual responses.

(The seemingly redundant extra toss in the first case is needed in situations where just the act of tossing a coin may be observed by others, even if the actual result stays hidden.) The confidentiality then arises from the refutability of the individual responses.

(在第一种情况下，看似多余的额外投掷是必要的，因为在这种情况下，即使实际结果仍然隐藏着，仅仅是抛硬币的动作就可能被其他人看到。)这种保密性来自于个人反驳的可驳斥性。

But, overall, these data with many responses are significant, since positive responses are given to a quarter by people who do not have the attribute A and three-quarters by people who actually possess it. Thus, if p is the true proportion of people with A, then we expect to obtain (1/4)(1-p) + (3/4)p = (1/4) + p/2 positive responses. Hence it is possible to estimate p.

But, overall, these data with many responses are significant, since positive responses are given to a quarter by people who do not have the attribute A and three-quarters by people who actually possess it. Thus, if p is the true proportion of people with A, then we expect to obtain (1/4)(1-p) + (3/4)p = (1/4) + p/2 positive responses. Hence it is possible to estimate p.

In particular, if the attribute A is synonymous with illegal behavior, then answering "Yes" is not incriminating, insofar as the person has a probability of a "Yes" response, whatever it may be.

In particular, if the attribute A is synonymous with illegal behavior, then answering "Yes" is not incriminating, insofar as the person has a probability of a "Yes" response, whatever it may be.

Although this example, inspired by randomized response, might be applicable to microdata (i.e., releasing datasets with each individual response), by definition differential privacy excludes microdata releases and is only applicable to queries (i.e., aggregating individual responses into one result) as this would violate the requirements, more specifically the plausible deniability that a subject participated or not.[15][16]

Although this example, inspired by randomized response, might be applicable to microdata (i.e., releasing datasets with each individual response), by definition differential privacy excludes microdata releases and is only applicable to queries (i.e., aggregating individual responses into one result) as this would violate the requirements, more specifically the plausible deniability that a subject participated or not.Dwork, Cynthia. "A firm foundation for private data analysis." Communications of the ACM 54.1 (2011): 86–95, supra note 19, page 91.Bambauer, Jane, Krishnamurty Muralidhar, and Rathindra Sarathy. "Fool's gold: an illustrated critique of differential privacy." Vand. J. Ent. & Tech. L. 16 (2013): 701.

### Stable transformations

A transformation $\displaystyle{ T }$ is $\displaystyle{ c }$-stable if the Hamming distance between $\displaystyle{ T(A) }$ and $\displaystyle{ T(B) }$ is at most $\displaystyle{ c }$-times the Hamming distance between $\displaystyle{ A }$ and $\displaystyle{ B }$ for any two databases $\displaystyle{ A,B }$. Theorem 2 in [10] asserts that if there is a mechanism $\displaystyle{ M }$ that is $\displaystyle{ \epsilon }$-differentially private, then the composite mechanism $\displaystyle{ M\circ T }$ is $\displaystyle{ (\epsilon \times c) }$-differentially private.

A transformation T is c-stable if the Hamming distance between T(A) and T(B) is at most c-times the Hamming distance between A and B for any two databases A,B. Theorem 2 in asserts that if there is a mechanism M that is \epsilon-differentially private, then the composite mechanism M\circ T is (\epsilon \times c)-differentially private.

This could be generalized to group privacy, as the group size could be thought of as the Hamming distance $\displaystyle{ h }$ between $\displaystyle{ A }$ and $\displaystyle{ B }$ (where $\displaystyle{ A }$ contains the group and $\displaystyle{ B }$ doesn't). In this case $\displaystyle{ M\circ T }$ is $\displaystyle{ (\epsilon \times c \times h) }$-differentially private.

This could be generalized to group privacy, as the group size could be thought of as the Hamming distance h between A and B (where A contains the group and B doesn't). In this case M\circ T is (\epsilon \times c \times h)-differentially private.

## Other notions of differential privacy

Since differential privacy is considered to be too strong or weak for some applications, many versions of it have been proposed.[17] The most widespread relaxation is (ε, δ)-differential privacy,[18] which weakens the definition by allowing an additional small δ density of probability on which the upper bound ε does not hold.

Since differential privacy is considered to be too strong or weak for some applications, many versions of it have been proposed. The most widespread relaxation is (ε, δ)-differential privacy, which weakens the definition by allowing an additional small δ density of probability on which the upper bound ε does not hold.

## Adoption of differential privacy in real-world applications

Several uses of differential privacy in practice are known to date:

• 2008: U.S. Census Bureau, for showing commuting patterns.[19]
• 2014: Google's RAPPOR, for telemetry such as learning statistics about unwanted software hijacking users' settings. [20][21]
• 2015: Google, for sharing historical traffic statistics.[22]
• 2016: Apple announced its intention to use differential privacy in iOS 10 to improve its Intelligent personal assistant technology.[23]
• 2017: Microsoft, for telemetry in Windows.[24]
• 2019: Privitar Lens is an API using differential privacy.[25]

Several uses of differential privacy in practice are known to date:

• 2008: U.S. Census Bureau, for showing commuting patterns.
• 2014: Google's RAPPOR, for telemetry such as learning statistics about unwanted software hijacking users' settings.
• 2015: Google, for sharing historical traffic statistics.
• 2016: Apple announced its intention to use differential privacy in iOS 10 to improve its Intelligent personal assistant technology.
• 2017: Microsoft, for telemetry in Windows.
• 2019: Privitar Lens is an API using differential privacy.

2008: u.s. Census Bureau，for shows comforting patterns. 在实践中，差分隐私的几个用途已经为人所知:

• 2008: 美国人口普查局，显示通勤模式。
• 2016年: 苹果公司宣布打算在 iOS 10中使用差分隐私智能个人助理来改进其智能个人助理技术。

## Public purpose considerations

There are several public purpose considerations regarding differential privacy that are important to consider, especially for policymakers and policy-focused audiences interested in the social opportunities and risks of the technology:[27]

There are several public purpose considerations regarding differential privacy that are important to consider, especially for policymakers and policy-focused audiences interested in the social opportunities and risks of the technology:

• Data Utility & Accuracy. The main concern with differential privacy is the tradeoff between data utility and individual privacy. If the privacy loss parameter is set to favor utility, the privacy benefits are lowered (less “noise” is injected into the system); if the privacy loss parameter is set to favor heavy privacy, the accuracy and utility of the dataset are lowered (more “noise” is injected into the system). It is important for policymakers to consider the tradeoffs posed by differential privacy in order to help set appropriate best practices and standards around the use of this privacy preserving practice, especially considering the diversity in organizational use cases. It is worth noting, though, that decreased accuracy and utility is a common issue among all statistical disclosure limitation methods and is not unique to differential privacy. What is unique, however, is how policymakers, researchers, and implementers can consider mitigating against the risks presented through this tradeoff.
• Data Utility & Accuracy. The main concern with differential privacy is the tradeoff between data utility and individual privacy. If the privacy loss parameter is set to favor utility, the privacy benefits are lowered (less “noise” is injected into the system); if the privacy loss parameter is set to favor heavy privacy, the accuracy and utility of the dataset are lowered (more “noise” is injected into the system). It is important for policymakers to consider the tradeoffs posed by differential privacy in order to help set appropriate best practices and standards around the use of this privacy preserving practice, especially considering the diversity in organizational use cases. It is worth noting, though, that decreased accuracy and utility is a common issue among all statistical disclosure limitation methods and is not unique to differential privacy. What is unique, however, is how policymakers, researchers, and implementers can consider mitigating against the risks presented through this tradeoff.

• 数据的实用性及准确性。差分隐私的主要关注点在于数据效用和个人隐私之间的权衡。如果将隐私损失参数设置为有利于实用性，则隐私好处降低(向系统中注入的“噪音”较少) ; 如果将隐私损失参数设置为有利于重隐私性，则数据集的准确性和实用性降低(向系统中注入更多的“噪音”)。对于决策者来说，重要的是要考虑到差分隐私的权衡，以帮助建立适当的最佳实践和标准来使用这种隐私保护实践，特别是考虑到组织用例的多样性。值得注意的是，在所有的统计披露限制方法中，降低准确性和效用是一个共同的问题，并不是差分隐私唯一的。然而，独特之处在于，决策者、研究人员和实施者可以考虑如何减轻这种权衡带来的风险。
• Data Privacy & Security. Differential privacy provides a quantified measure of privacy loss and an upper bound and allows curators to choose the explicit tradeoff between privacy and accuracy. It is robust to still unknown privacy attacks. However, it encourages greater data sharing, which if done poorly, increases privacy risk. Differential privacy implies that privacy is protected, but this depends very much on the privacy loss parameter chosen and may instead lead to a false sense of security. Finally, though it is robust against unforeseen future privacy attacks, a countermeasure may be devised that we cannot predict.
• Data Privacy & Security. Differential privacy provides a quantified measure of privacy loss and an upper bound and allows curators to choose the explicit tradeoff between privacy and accuracy. It is robust to still unknown privacy attacks. However, it encourages greater data sharing, which if done poorly, increases privacy risk. Differential privacy implies that privacy is protected, but this depends very much on the privacy loss parameter chosen and may instead lead to a false sense of security. Finally, though it is robust against unforeseen future privacy attacks, a countermeasure may be devised that we cannot predict.

• 资料私隐及保安。差分隐私图书馆提供了一个量化的隐私损失度量和上限，并允许馆长在隐私和准确性之间做出明确的权衡。它对仍然未知的隐私攻击是健壮的。然而，它鼓励更大的数据共享，如果做得不好，会增加隐私风险。差分隐私意味着隐私是受到保护的，但这在很大程度上取决于选择的隐私损失参数，并可能会导致错误的安全感。最后，尽管它对未来不可预见的隐私攻击是健壮的，但可以设计出一种我们无法预测的对策。

• Quasi-identifier
• Exponential mechanism (differential privacy) – a technique for designing differentially private algorithms
• k-anonymity
• Differentially private analysis of graphs
• Protected health information

• 准标识符
• 指数机制(差分隐私)-一种设计不同私有算法的技术
• k-匿名
• 图的不同私有分析
• 受保护的健康信息

## References

• A reading list on differential privacy
• Abowd, John. 2017. “How Will Statistical Agencies Operate When All Data Are Private?”. Journal of Privacy and Confidentiality 7 (3). (slides)
• "Differential Privacy: A Primer for a Non-technical Audience", Kobbi Nissim, Thomas Steinke, Alexandra Wood, Micah Altman, Aaron Bembenek, Mark Bun, Marco Gaboardi, David R. O’Brien, and Salil Vadhan, Harvard Privacy Tools Project, February 14, 2018
• Dinur, Irit and Kobbi Nissim. 2003. Revealing information while preserving privacy. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems(PODS '03). ACM, New York, NY, USA, 202-210. .
• Dwork, Cynthia, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. in Halevi, S. & Rabin, T. (Eds.) Calibrating Noise to Sensitivity in Private Data Analysis Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4–7, 2006. Proceedings, Springer Berlin Heidelberg, 265-284, .
• Dwork, Cynthia. 2006. Differential Privacy, 33rd International Colloquium on Automata, Languages and Programming, part II (ICALP 2006), Springer Verlag, 4052, 1-12, .
• Dwork, Cynthia and Aaron Roth. 2014. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science. Vol. 9, Nos. 3–4. 211–407, .
• Machanavajjhala, Ashwin, Daniel Kifer, John M. Abowd, Johannes Gehrke, and Lars Vilhuber. 2008. Privacy: Theory Meets Practice on the Map, International Conference on Data Engineering (ICDE) 2008: 277-286, .
• Dwork, Cynthia and Moni Naor. 2010. On the Difficulties of Disclosure Prevention in Statistical Databases or The Case for Differential Privacy, Journal of Privacy and Confidentiality: Vol. 2: Iss. 1, Article 8. Available at: http://repository.cmu.edu/jpc/vol2/iss1/8.
• Kifer, Daniel and Ashwin Machanavajjhala. 2011. No free lunch in data privacy. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data (SIGMOD '11). ACM, New York, NY, USA, 193-204. .
• Erlingsson, Úlfar, Vasyl Pihur and Aleksandra Korolova. 2014. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS '14). ACM, New York, NY, USA, 1054-1067. .
• Abowd, John M. and Ian M. Schmutte. 2017 . Revisiting the economics of privacy: Population statistics and confidentiality protection as public goods. Labor Dynamics Institute, Cornell University, Labor Dynamics Institute, Cornell University, at https://digitalcommons.ilr.cornell.edu/ldi/37/
• Abowd, John M. and Ian M. Schmutte. Forthcoming. An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices. American Economic Review,
• Apple, Inc. 2016. Apple previews iOS 10, the biggest iOS release ever. Press Release (June 13). https://www.apple.com/newsroom/2016/06/apple-previews-ios-10-biggest-ios-release-ever.html.
• Ding, Bolin, Janardhan Kulkarni, and Sergey Yekhanin 2017. Collecting Telemetry Data Privately, NIPS 2017.
• http://www.win-vector.com/blog/2015/10/a-simpler-explanation-of-differential-privacy/
• Ryffel, Theo, Andrew Trask, et. al. "A generic framework for privacy preserving deep learning"

• “差分隐私: 非技术观众入门”，Kobbi Nissim，Thomas Steinke，Alexandra Wood，Micah Altman，Aaron Bembenek，Mark Bun，Marco gabordi，David r. o’brien，and Salil Vadhan，Harvard Privacy Tools Project，February 14,2018
• Dinur，Irit and Kobbi Nissim。2003.在保护隐私的同时披露信息。在第二十二届 ACM SIGMOD-SIGACT-SIGART 数据库系统原理研讨会会议录(PODS’03)。ACM，纽约，纽约，美国，202-210. 。
• Dwork、 Cynthia、 Frank McSherry、 Kobbi Nissim 和 Adam Smith。2006. in Halevi，s & Rabin，t.(Eds.)在密码学的私人数据分析理论中校准噪声的灵敏度: 第三次密码学理论会议，TCC 2006，纽约，纽约，美国，2006年3月4-7。美国国家科学院院刊，Springer Berlin Heidelberg，265-284，。
• 辛西娅。2006.差分隐私，第33届国际自动机，语言和编程学术讨论会，第二部分(ICALP 2006) ，Springer Verlag，4052,1-12，。
• Dwork，Cynthia and Aaron Roth.2014.差分隐私的算法基础。理论计算机科学的基础与发展趋势。第一卷。9，Nos.3–4.211–407, .
• Machanavajjhala，Ashwin，Daniel Kifer，John m. Abowd，Johannes Gehrke，and Lars Vilhuber.2008.隐私权: 理论与实践的结合，国际数据工程会议2008:277-286，。
• Dwork、 Cynthia 和 Moni Naor。2010.关于统计数据库中的披露防范的困难或者差分隐私的案例，隐私和保密期刊: 第一卷。2: Iss.1，第8条。网址: http://repository.cmu.edu/jpc/vol2/iss1/8。
• Kifer，Daniel and Ashwin Machanavajjhala.2011.数据隐私没有免费午餐。在2011年 ACM SIGMOD 国际数据管理会议记录(SIGMOD’11)。ACM，纽约，纽约，美国，193-204. 。
• Erlingsson, Úlfar, Vasyl Pihur and Aleksandra Korolova.2014.RAPPOR: 随机可聚合隐私保护顺序响应。在2014年 ACM SIGSAC 计算机和通信安全会议(CCS’14)的会议记录中。ACM，纽约，纽约，美国，1054-1067。
• 以上，约翰 · m · 施穆特和伊恩 · m · 施穆特。2017 .重温隐私经济学: 人口统计和保密性保护作为公共产品。劳动动力学研究所，康奈尔大学，劳动动力学研究所，康奈尔大学， https://digitalcommons.ilr.cornell.edu/ldi/37/。即将到来。作为社会选择的隐私权保护与统计准确性的经济学分析。美国经济评论》
• 苹果公司，2016。苹果预览 iOS 10，史上最大的 iOS 发布。新闻稿(六月十三日)。Https://www.apple.com/newsroom/2016/06/apple-previews-ios-10-biggest-ios-release-ever.html.
• 丁、博林、贾纳丹•库尔卡尼及谢尔盖•叶卡宁二○一七。私下收集遥测数据 NIPS 2017。
• http://www.win-vector.com/blog/2015/10/a-simpler-explanation-of-differential-privacy/