差分隐私

来自集智百科 - 复杂系统|人工智能|复杂科学|复杂网络|自组织
Litinunispazio97讨论 | 贡献2021年11月13日 (六) 09:36的版本
跳到导航 跳到搜索

此词条暂由彩云小译翻译,翻译字数共4096,未经人工整理,Litinunispazio97审校中,带来阅读不便,请见谅。

Differential privacy is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. The idea behind differential privacy is that if the effect of making an arbitrary single substitution in the database is small enough, the query result cannot be used to infer much about any single individual, and therefore provides privacy. Another way to describe differential privacy is as a constraint on the algorithms used to publish aggregate information about a statistical database which limits the disclosure of private information of records whose information is in the database. For example, differentially private algorithms are used by some government agencies to publish demographic information or other statistical aggregates while ensuring confidentiality of survey responses, and by companies to collect information about user behavior while controlling what is visible even to internal analysts.

差分隐私是一个用于公开分享数据集信息的系统,它在描述数据集中的群体特征的同时保护了数据集中的个人信息。差分隐私的理念是,如果在数据库中进行任意单次更迭的影响足够小,那么查询结果就不能用于推断任何单一个体的大量信息,因此个体的隐私得以保证。另一种对于差分隐私的描述表示,这是针对发布统计数据库Statistical Database的汇集信息的算法的约束条件,这种算法限制了数据库中信息的记录的个人信息的披露。例如,一些政府机构使用差分隐私算法公布人口信息或其他统计数据,同时确保调查结果的保密性Confidentiality;而公司则使用该算法收集用户行为信息,同时控制哪些信息对于内部分析人员是可见的。

Roughly, an algorithm is differentially private if an observer seeing its output cannot tell if a particular individual's information was used in the computation. Differential privacy is often discussed in the context of identifying individuals whose information may be in a database. Although it does not directly refer to identification and reidentification attacks, differentially private algorithms probably resist such attacks.[1]

简单来说,如果观察者发现某算法的输出不能推断出一个特定个体的信息是否被用于其计算,则该算法是差分隐私的。通常,差分隐私算法会在识别其信息可能存在于数据库中的个体时被讨论。虽然差分隐私算法不直接涉及身份识别和身份重识别Reidentification攻击,但它或许能够防御这些攻击。[1]

Differential privacy was developed by cryptographers and thus is often associated with cryptography, and draws much of its language from cryptography.

差分隐私是由密码学家开发的,因此经常与密码学Cryptography相关,且其大量内容来自于密码学。

History 历史

Official statistics organizations are charged with collecting information from individuals or establishments, and publishing aggregate data to serve the public interest. For example, the 1790 United States Census collected information about individuals living in the United States and published tabulations based on sex, age, race, and condition of servitude. Statistical organizations have long collected information under a promise of confidentiality that the information provided will be used for statistical purposes, but that the publications will not produce information that can be traced back to a specific individual or establishment. To accomplish this goal, statistical organizations have long suppressed information in their publications. For example, in a table presenting the sales of each business in a town grouped by business category, a cell that has information from only one company might be suppressed, in order to maintain the confidentiality of that company's specific sales.

官方统计机构负责收集个人或机构的信息并公布汇集数据,以服务公众利益。例如,1790年的美国人口普查1790 United States Census收集了生活在美国的个人的信息,并公布了基于性别、年龄、种族和奴役情况的表格。统计机构长期保证收集信息的保密性Confidentiality,即所提供的信息将用于统计目的,但出版物不会产生可追溯到具体个人或机构的信息。为了实现这一目标,统计机构长期以来一直在其出版物中压缩信息。例如,在按产业类别表示某个城镇中企业的销售情况的表格中,只有一家公司信息的单元格可能会被隐藏,从而对该公司的具体销售情况保密。

The adoption of electronic information processing systems by statistical agencies in the 1950s and 1960s dramatically increased the number of tables that a statistical organization could produce and, in so doing, significantly increased the potential for an improper disclosure of confidential information. For example, if a business that had its sales numbers suppressed also had those numbers appear in the total sales of a region, then it might be possible to determine the suppressed value by subtracting the other sales from that total. But there might also be combinations of additions and subtractions that might cause the private information to be revealed. The number of combinations that needed to be checked increases exponentially with the number of publications, and it is potentially unbounded if data users are able to make queries of the statistical database using an interactive query system.

1950年代和1960年代,统计机构采用了电子信息处理系统,这大量增加了统计机构可产出的表格数量,从而显著提高了不当暴露机密信息的潜在可能性。例如,如果一个企业的销售数字被隐藏,这些数字也被统计在该地区的总销售额中,那么可能从总销售额中减去其他销售额即可确定被隐藏的数据。或者也有可能通过增法和减法的结合,导致隐私信息被泄露。随着出版物数量的增加,需要检查的组合数量呈指数增长,甚至没有上限——如果数据的用户能够使用交互式查询系统对统计数据库进行查询。

In 1977, Tore Dalenius formalized the mathematics of cell suppression.[2]

1977年,托雷 · 达伦纽斯规范了单元抑制Cell Suppression的数学模型。[2]

In 1979, Dorothy Denning, Peter J. Denning and Mayer D. Schwartz formalized the concept of a Tracker, an adversary that could learn the confidential contents of a statistical database by creating a series of targeted queries and remembering the results.[3] This and future research showed that privacy properties in a database could only be preserved by considering each new query in light of (possibly all) previous queries. This line of work is sometimes called query privacy, with the final result being that tracking the impact of a query on the privacy of individuals in the database was NP-hard.

1979年,多萝西 · 丹宁(Dorothy Denning)、彼得 · j · 丹宁(Peter j. Denning)和迈耶 · d · 施瓦茨(Mayer d. Schwartz)正式提出了“跟踪器”(Tracker)的概念,这个对手可以通过创建一系列有针对性的查询并储存查询结果,来分析获取统计数据库的机密内容。[3]这项研究和包括之后的研究表明,只有根据(可能是所有)以前的查询来考虑每个新的查询,才能保证数据库的隐私属性。这种工作有时被称为查询隐私,最终的结果是,追溯某查询对于数据库中个人隐私造成的影响是 NP-hard问题。

In 2003, Kobbi Nissim and Irit Dinur demonstrated that it is impossible to publish arbitrary queries on a private statistical database without revealing some amount of private information, and that the entire information content of the database can be revealed by publishing the results of a surprisingly small number of random queries—far fewer than was implied by previous work.[4] The general phenomenon is known as the Fundamental Law of Information Recovery, and its key insight, namely that in the most general case, privacy cannot be protected without injecting some amount of noise, led to development of differential privacy.

2003年,Kobbi Nissim和Irit Dinur证明,在私人统计数据库上发布任意查询而不泄露一些私人信息是不可能的,而且通过公开少得惊人的随机查询的结果就可能会显示数据库的全部信息内容——远远少于以往研究的推断。[4]这种现象被称为信息恢复基本定律Fundamental Law of Information Recovery,其核心观点是,在最普遍的情况下,隐私无法在不加入一定量的噪音的情况下得到保护,这导致了差分隐私的发展。

In 2006, Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam D. Smith published an article formalizing the amount of noise that needed to be added and proposing a generalized mechanism for doing so.[1] Their work was a co-recipient of the 2016 TCC Test-of-Time Award[5] and the 2017 Gödel Prize.[6]

In 2006, Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam D. Smith published an article formalizing the amount of noise that needed to be added and proposing a generalized mechanism for doing so. Their work was a co-recipient of the 2016 TCC Test-of-Time Award and the 2017 Gödel Prize.

2006年,辛西娅 · 德沃克、弗兰克 · 麦克谢里、科比 · 尼西姆和亚当 · d · 史密斯发表了一篇文章,正式确定了需要增加的噪音量,并提出了一种通用的增加噪音的机制。[1] 他们的工作共同获得了2016年移动通信协会时间检验奖[5]和2017年哥德尔奖[6]

Since then, subsequent research has shown that there are many ways to produce very accurate statistics from the database while still ensuring high levels of privacy.[7][8]

自此,后续的研究展示了许多方法,可以在保证高度隐私的同时,从数据库中生成非常准确的统计数据。[7][8]

ε-differential privacy ε-差分隐私

The 2006 Dwork, McSherry, Nissim and Smith article introduced the concept of ε-differential privacy, a mathematical definition for the privacy loss associated with any data release drawn from a statistical database. (Here, the term statistical database means a set of data that are collected under the pledge of confidentiality for the purpose of producing statistics that, by their production, do not compromise the privacy of those individuals who provided the data.)


2006年Dwork、McSherry、Nissim和Smith的文章引入了ε-差分隐私的概念,这是一个数学定义,用来定义从统计数据库中发布的任何数据所导致的隐私损失。(此处的统计数据库一词是指遵从保密性承诺收集的一组数据,目的是统计数据,同时不会因其数据的产生损害提供数据的个人的隐私。)


The intuition for the 2006 definition of ε-differential privacy is that a person's privacy cannot be compromised by a statistical release if their data are not in the database. Therefore, with differential privacy, the goal is to give each individual roughly the same privacy that would result from having their data removed. That is, the statistical functions run on the database should not overly depend on the data of any one individual.

2006年,人们对ε-差分隐私的定义的直觉认知是,如果一个人的数据不在数据库中,那么他的隐私就不会因为统计数据的公布而受到损害。因此,差分隐私的目标是给每个人大体相同的隐私性,这将导致他们的数据删除。也就是说,在数据库上运行的统计功能不应过分依赖于任何个人的数据。

Of course, how much any individual contributes to the result of a database query depends in part on how many people's data are involved in the query. If the database contains data from a single person, that person's data contributes 100%. If the database contains data from a hundred people, each person's data contributes just 1%. The key insight of differential privacy is that as the query is made on the data of fewer and fewer people, more noise needs to be added to the query result to produce the same amount of privacy. Hence the name of the 2006 paper, "Calibrating noise to sensitivity in private data analysis."

当然,任何个体对数据库查询结果的贡献率部分取决于该查询的数据来自的人员的数量。如果数据库只包含来自单个人的数据,则该人的数据贡献率为100% 。如果数据库包含来自100人的数据,则每个人的数据贡献率仅为1% 。差分隐私的主要观点是,由于查询的数据来自的人员数量越来越少,所以查询结果需要被添加更多的噪音以达到同样程度的隐私性。因此,这篇2006年的论文被命名为《隐私数据分析中噪声灵敏度的校准》。

The 2006 paper presents both a mathematical definition of differential privacy and a mechanism based on the addition of Laplace noise (i.e. noise coming from the Laplace distribution) that satisfies the definition.


这篇2006年的论文给出了差分隐私的数学定义,以及加入拉普拉斯噪音(i. e. 满足拉普拉斯分布Laplace Distribution的噪音)后能够满足该定义的机制。

Definition of ε-differential privacy ε-差分隐私的定义

Let ε be a positive real number and [math]\displaystyle{ \mathcal{A} }[/math] be a randomized algorithm that takes a dataset as input (representing the actions of the trusted party holding the data). Let [math]\displaystyle{ \textrm{im}\ \mathcal{A} }[/math] denote the image of [math]\displaystyle{ \mathcal{A} }[/math]. The algorithm [math]\displaystyle{ \mathcal{A} }[/math] is said to provide [math]\displaystyle{ \epsilon }[/math]-differential privacy if, for all datasets [math]\displaystyle{ D_1 }[/math] and [math]\displaystyle{ D_2 }[/math] that differ on a single element (i.e., the data of one person), and all subsets [math]\displaystyle{ S }[/math] of [math]\displaystyle{ \textrm{im}\ \mathcal{A} }[/math]:

[math]\displaystyle{ \Pr[\mathcal{A}(D_1) \in S] \leq \exp\left(\epsilon\right) \cdot \Pr[\mathcal{A}(D_2) \in S], }[/math]

where the probability is taken over the randomness used by the algorithm.[9]

设 ε 是一个正实数Real Number,而 [math]\displaystyle{ \mathcal{A} }[/math] 是一个以数据集作为输入(表示持有数据的受信任方的操作)的随机算法Randomized Algorithm。让 [math]\displaystyle{ \textrm{im}\ \mathcal{A} }[/math]表示数学[math]\displaystyle{ \mathcal{A} }[/math]的映像。如果对于所有在单个元素上不同(例如一个人的数据)的数据集 [math]\displaystyle{ D_1 }[/math][math]\displaystyle{ D_2 }[/math] ,以及所有[math]\displaystyle{ \textrm{im}\ \mathcal{A} }[/math]的子集[math]\displaystyle{ S }[/math]满足:

[math]\displaystyle{ \Pr[\mathcal{A}(D_1) \in S] \leq \exp\left(\epsilon\right) \cdot \Pr[\mathcal{A}(D_2) \in S], }[/math]

其中概率取代了算法所使用的随机性,那么算法[math]\displaystyle{ \mathcal{A} }[/math]被认为实现了 [math]\displaystyle{ \epsilon }[/math]差分隐私。[9]


Differential privacy offers strong and robust guarantees that facilitate modular design and analysis of differentially private mechanisms due to its composability, robustness to post-processing, and graceful degradation in the presence of correlated data.

差异化隐私提供了稳健性和鲁棒性保证,其可组合性、对后处理的鲁棒性以及在相关数据存在的情况下的功能损耗,促进了差分隐私机制的模块化设计和分析。

Composability 可组合性

(Self-)composability refers to the fact that the joint distribution of the outputs of (possibly adaptively chosen) differentially private mechanisms satisfies differential privacy.

(Self -)可组合性是指(可能是自适应选择的)差分隐私机制的输出的联合分布满足差分隐私的条件。

Sequential composition. If we query an ε-differential privacy mechanism [math]\displaystyle{ t }[/math] times, and the randomization of the mechanism is independent for each query, then the result would be [math]\displaystyle{ \epsilon t }[/math]-differentially private. In the more general case, if there are [math]\displaystyle{ n }[/math] independent mechanisms: [math]\displaystyle{ \mathcal{M}_1,\dots,\mathcal{M}_n }[/math], whose privacy guarantees are [math]\displaystyle{ \epsilon_1,\dots,\epsilon_n }[/math] differential privacy, respectively, then any function [math]\displaystyle{ g }[/math] of them: [math]\displaystyle{ g(\mathcal{M}_1,\dots,\mathcal{M}_n) }[/math] is [math]\displaystyle{ \left(\sum\limits_{i=1}^{n} \epsilon_i\right) }[/math]-differentially private.[10]

顺序合成。如果我们对一个ε-差分隐私机制进行[math]\displaystyle{ t }[/math]次查询,并且该机制的随机化独立于每个查询,那么结果将是[math]\displaystyle{ \epsilon t }[/math]- 差分隐私的。在更普遍的情况下,如果存在[math]\displaystyle{ n }[/math]个独立的机制:[math]\displaystyle{ \mathcal{M}_1,\dots,\mathcal{M}_n }[/math]——其隐私保证分别是[math]\displaystyle{ \epsilon_1,\dots,\epsilon_n }[/math]差分隐私,那么它们的任何函数[math]\displaystyle{ g }[/math][math]\displaystyle{ g(\mathcal{M}_1,\dots,\mathcal{M}_n) }[/math][math]\displaystyle{ \left(\sum\limits_{i=1}^{n} \epsilon_i\right) }[/math]-差分隐私的。

Parallel composition. If the previous mechanisms are computed on disjoint subsets of the private database then the function [math]\displaystyle{ g }[/math] would be [math]\displaystyle{ (\max_i \epsilon_i) }[/math]-differentially private instead.[10]

并行合成。如果前面的机制是在私有数据库的不相交子集上计算的,那么函数[math]\displaystyle{ g }[/math]将是[math]\displaystyle{ (\max_i \epsilon_i) }[/math]-差分隐私的。[10]

Robustness to post-processing 后处理的鲁棒性

For any deterministic or randomized function [math]\displaystyle{ F }[/math] defined over the image of the mechanism [math]\displaystyle{ \mathcal{A} }[/math], if [math]\displaystyle{ \mathcal{A} }[/math] satisfies ε-differential privacy, so does [math]\displaystyle{ F(\mathcal{A}) }[/math].

对于任意在机制[math]\displaystyle{ \mathcal{A} }[/math]的映射上定义的确定性函数或随机函数[math]\displaystyle{ F }[/math],如果[math]\displaystyle{ \mathcal{A} }[/math]满足ε-差分隐私性,则[math]\displaystyle{ F(\mathcal{A}) }[/math]也满足ε-差分隐私性。

Together, composability and robustness to post-processing permit modular construction and analysis of differentially private mechanisms and motivate the concept of the privacy loss budget. If all elements that access sensitive data of a complex mechanisms are separately differentially private, so will be their combination, followed by arbitrary post-processing.

可组合性和后处理的鲁棒性保证了差分隐私机制的模块化构建和分析,并促成了隐私损失预算的概念。如果访问复杂机制的敏感数据的所有部分都是分别差分隐私的,那么不论接下来是任何后处理,它们的组合依然是差分隐私的。

Group privacy 群体隐私

In general, ε-differential privacy is designed to protect the privacy between neighboring databases which differ only in one row. This means that no adversary with arbitrary auxiliary information can know if one particular participant submitted his information. However this is also extendable if we want to protect databases differing in [math]\displaystyle{ c }[/math] rows, which amounts to adversary with arbitrary auxiliary information can know if [math]\displaystyle{ c }[/math] particular participants submitted their information. This can be achieved because if [math]\displaystyle{ c }[/math] items change, the probability dilation is bounded by [math]\displaystyle{ \exp ( \epsilon c ) }[/math] instead of [math]\displaystyle{ \exp ( \epsilon ) }[/math],[11] i.e., for D1 and D2 differing on [math]\displaystyle{ c }[/math] items:

一般来说,设计ε- 差分隐私的目的是保护只有一行不同的相邻数据库之间的隐私。这意味着任何拥有任意辅助信息的对手都不能知道一个特定的参与者是否提交了他的信息。然而,如果我们想要保护不同于 [math]\displaystyle{ c }[/math]行的数据库,这也是可扩展的,这相当于拥有任意辅助信息的对手可以知道 [math]\displaystyle{ c }[/math]特定的参与者是否提交了他们的信息。这是可以实现的,因为如果 [math]\displaystyle{ c }[/math]项发生变化,概率膨胀的界限是[math]\displaystyle{ \exp ( \epsilon c ) }[/math],而不是[math]\displaystyle{ \exp ( \epsilon ) }[/math][11],例如对于在[math]\displaystyle{ c }[/math]项上不同的 D1和 D2来说:

[math]\displaystyle{ \Pr[\mathcal{A}(D_{1})\in S]\leq \exp(\epsilon c)\cdot\Pr[\mathcal{A}(D_{2})\in S]\,\! }[/math]

Thus setting ε instead to [math]\displaystyle{ \epsilon/c }[/math] achieves the desired result (protection of [math]\displaystyle{ c }[/math] items). In other words, instead of having each item ε-differentially private protected, now every group of [math]\displaystyle{ c }[/math] items is ε-differentially private protected (and each item is [math]\displaystyle{ (\epsilon/c) }[/math]-differentially private protected).

因此,将 ε 设置为[math]\displaystyle{ \epsilon/c }[/math]可以达到预期的结果([math]\displaystyle{ c }[/math]项的保护)。换句话说,并非每一项都实现ε- 差分隐私保护,而是每组的[math]\displaystyle{ c }[/math]项都被ε- 差分隐私保护(并且每一项都被[math]\displaystyle{ (\epsilon/c) }[/math]-差分隐私保护)。

ε-differentially private mechanisms ε-差分隐私机制

Since differential privacy is a probabilistic concept, any differentially private mechanism is necessarily randomized. Some of these, like the Laplace mechanism, described below, rely on adding controlled noise to the function that we want to compute. Others, like the exponential mechanism[12] and posterior sampling[13] sample from a problem-dependent family of distributions instead.

因为差分隐私是一个概率概念,所以任何差分隐私机制都必然是随机的。其中一些,如接下来描述的拉普拉斯机制,依赖于在我们要计算的函数中添加可控噪声。其他的机制,如指数机制[12]和后验抽样[13],则是从一个与问题相关的分布系列中取样。

Sensitivity 敏感性

Let [math]\displaystyle{ d }[/math] be a positive integer, [math]\displaystyle{ \mathcal{D} }[/math] be a collection of datasets, and [math]\displaystyle{ f \colon \mathcal{D} \rightarrow \mathbb{R}^d }[/math] be a function. The sensitivity [1] of a function, denoted [math]\displaystyle{ \Delta f }[/math], is defined by

[math]\displaystyle{ \Delta f=\max \lVert f(D_1)-f(D_2) \rVert_1, }[/math]

where the maximum is over all pairs of datasets [math]\displaystyle{ D_1 }[/math] and [math]\displaystyle{ D_2 }[/math] in [math]\displaystyle{ \mathcal{D} }[/math] differing in at most one element and [math]\displaystyle{ \lVert \cdot \rVert_1 }[/math] denotes the [math]\displaystyle{ \ell_1 }[/math] norm.

[math]\displaystyle{ d }[/math]为一正整数,[math]\displaystyle{ \mathcal{D} }[/math]为一数据集,[math]\displaystyle{ f \colon \mathcal{D} \rightarrow \mathbb{R}^d }[/math]是函数。函数的灵敏度[1]记为[math]\displaystyle{ \Delta f }[/math],定义为:

[math]\displaystyle{ \Delta f=\max \lVert f(D_1)-f(D_2) \rVert_1, }[/math]

其中最大值是[math]\displaystyle{ \mathcal{D} }[/math]中所有最多只有一个元素不同的一对数据集[math]\displaystyle{ D_1 }[/math][math]\displaystyle{ D_2 }[/math],而 [math]\displaystyle{ \lVert \cdot \rVert_1 }[/math]定义了[math]\displaystyle{ \ell_1 }[/math]范数。

In the example of the medical database below, if we consider [math]\displaystyle{ f }[/math] to be the function [math]\displaystyle{ Q_i }[/math], then the sensitivity of the function is one, since changing any one of the entries in the database causes the output of the function to change by either zero or one.

在下面的医学数据库的例子中,如果我们认为[math]\displaystyle{ f }[/math]是函数[math]\displaystyle{ Q_i }[/math],那么函数的灵敏度是1,因为改变数据库中的任何一个条目都会导致函数的输出变为0或1。

There are techniques (which are described below) using which we can create a differentially private algorithm for functions with low sensitivity.

我们也可以通过一些技术(如下所述) 使用低灵敏度函数设计差分隐私算法。

The Laplace mechanism 拉普拉斯机制

The Laplace mechanism adds Laplace noise (i.e. noise from the Laplace distribution, which can be expressed by probability density function [math]\displaystyle{ \text{noise}(y)\propto \exp(-|y|/\lambda)\,\! }[/math], which has mean zero and standard deviation [math]\displaystyle{ \sqrt{2} \lambda\,\! }[/math]). Now in our case we define the output function of [math]\displaystyle{ \mathcal{A}\,\! }[/math] as a real valued function (called as the transcript output by [math]\displaystyle{ \mathcal{A}\,\! }[/math]) as [math]\displaystyle{ \mathcal{T}_{\mathcal{A}}(x)=f(x)+Y\,\! }[/math] where [math]\displaystyle{ Y \sim \text{Lap}(\lambda)\,\!\,\! }[/math] and [math]\displaystyle{ f\,\! }[/math] is the original real valued query/function we planned to execute on the database. Now clearly [math]\displaystyle{ \mathcal{T}_{\mathcal{A}}(x)\,\! }[/math] can be considered to be a continuous random variable, where

[math]\displaystyle{ \frac{\mathrm{pdf}(\mathcal{T}_{\mathcal{A},D_1}(x)=t)}{\mathrm{pdf}(\mathcal{T}_{\mathcal{A},D_2}(x)=t)}=\frac{\text{noise}(t-f(D_1))}{\text{noise}(t-f(D_2))}\,\! }[/math]

拉普拉斯机制增加了拉普拉斯的噪音(例如满足拉普拉斯分布的噪声,其可以用概率密度函数[math]\displaystyle{ \text{noise}(y)\propto \exp(-|y|/\lambda)\,\! }[/math]来表示,这个函数的平均值是0,标准差为[math]\displaystyle{ \sqrt{2} \lambda\,\! }[/math])。在我们的例子中,我们定义了输出函数[math]\displaystyle{ \mathcal{A}\,\! }[/math]作为一个实值函数(由 [math]\displaystyle{ \mathcal{A}\,\! }[/math]调用为文本输出)[math]\displaystyle{ \mathcal{T}_{\mathcal{A}}(x)=f(x)+Y\,\! }[/math],其中[math]\displaystyle{ Y \sim \text{Lap}(\lambda)\,\!\,\! }[/math],且[math]\displaystyle{ f\,\! }[/math]是我们计划在数据库上执行的原始实值的查询(或函数)。现在可以确定,[math]\displaystyle{ \mathcal{T}_{\mathcal{A}}(x)\,\! }[/math]可以被认为是一个连续的随机变量,当

[math]\displaystyle{ \frac{\mathrm{pdf}(\mathcal{T}_{\mathcal{A},D_1}(x)=t)}{\mathrm{pdf}(\mathcal{T}_{\mathcal{A},D_2}(x)=t)}=\frac{\text{noise}(t-f(D_1))}{\text{noise}(t-f(D_2))}\,\! }[/math]

which is at most [math]\displaystyle{ e^{\frac{|f(D_{1})-f(D_{2})|}{\lambda}}\leq e^{\frac{\Delta(f)}{\lambda}}\,\! }[/math]. We can consider [math]\displaystyle{ \frac{\Delta(f)}{\lambda}\,\! }[/math] to be the privacy factor [math]\displaystyle{ \epsilon\,\! }[/math]. Thus [math]\displaystyle{ \mathcal{T}\,\! }[/math] follows a differentially private mechanism (as can be seen from the definition above). If we try to use this concept in our diabetes example then it follows from the above derived fact that in order to have [math]\displaystyle{ \mathcal{A}\,\! }[/math] as the [math]\displaystyle{ \epsilon\,\! }[/math]-differential private algorithm we need to have [math]\displaystyle{ \lambda=1/\epsilon\,\! }[/math]. Though we have used Laplace noise here, other forms of noise, such as the Gaussian Noise, can be employed, but they may require a slight relaxation of the definition of differential privacy.[11]

其最多为 [math]\displaystyle{ e^{\frac{|f(D_{1})-f(D_{2})|}{\lambda}}\leq e^{\frac{\Delta(f)}{\lambda}}\,\! }[/math]。我们可以设[math]\displaystyle{ \frac{\Delta(f)}{\lambda}\,\! }[/math]为隐私因子[math]\displaystyle{ \epsilon\,\! }[/math]。因此,[math]\displaystyle{ \mathcal{T}\,\! }[/math]满足差分隐私机制的规则(从上面的定义可以看出)。如果尝试在我们的糖尿病例子中使用这个概念,那么从上述事实可以推出,为了使用[math]\displaystyle{ \mathcal{A}\,\! }[/math]作为[math]\displaystyle{ \epsilon\,\! }[/math]-差分隐私算法,我们需要满足[math]\displaystyle{ \lambda=1/\epsilon\,\! }[/math]。虽然我们在这里使用了拉普拉斯噪音,但是其他形式的噪音,如高斯噪音也是可使用的——但是可能需要稍微放宽差分隐私的定义。[11]

According to this definition, differential privacy is a condition on the release mechanism (i.e., the trusted party releasing information about the dataset) and not on the dataset itself. Intuitively, this means that for any two datasets that are similar, a given differentially private algorithm will behave approximately the same on both datasets. The definition gives a strong guarantee that presence or absence of an individual will not affect the final output of the algorithm significantly.

根据这个定义,差分隐私是发布机制的一个条件(例如,由可信方发布数据集的信息),而不是数据集本身。直观地说,这意味着对于任何两个相似的数据集,一个给定的差分隐私算法在两个数据集上的表现将大致相同。该定义强有力地保证了个体的存在与否不会对算法的最终输出产生重大影响。

For example, assume we have a database of medical records [math]\displaystyle{ D_1 }[/math] where each record is a pair (Name, X), where [math]\displaystyle{ X }[/math] is a Boolean denoting whether a person has diabetes or not. For example:

例如,假设我们有一个关于医疗记录的数据库[math]\displaystyle{ D_1 }[/math],其中每个记录是一对(Name, X),其中[math]\displaystyle{ X }[/math]是一个布尔值Boolean Algebra,表示一个人是否患有糖尿病。例如:

Name Has Diabetes (X)
Ross 1
Monica 1
Joey 0
Phoebe 0
Chandler 1
Rachel 0

Now suppose a malicious user (often termed an adversary) wants to find whether Chandler has diabetes or not. Suppose he also knows in which row of the database Chandler resides. Now suppose the adversary is only allowed to use a particular form of query [math]\displaystyle{ Q_i }[/math] that returns the partial sum of the first [math]\displaystyle{ i }[/math] rows of column [math]\displaystyle{ X }[/math] in the database. In order to find Chandler's diabetes status the adversary executes [math]\displaystyle{ Q_5(D_1) }[/math] and [math]\displaystyle{ Q_4(D_1) }[/math], then computes their difference. In this example, [math]\displaystyle{ Q_5(D_1) = 3 }[/math] and [math]\displaystyle{ Q_4(D_1) = 2 }[/math], so their difference is 1. This indicates that the "Has Diabetes" field in Chandler's row must be 1. This example highlights how individual information can be compromised even without explicitly querying for the information of a specific individual.

现在假设一个恶意用户(通常称为对手)想要查看 Chandler 是否患有糖尿病。假设他也知道 Chandler 在数据库的哪一行中。现在假设对手只被允许使用特定形式的查询[math]\displaystyle{ Q_i }[/math],该查询返回数据库中列[math]\displaystyle{ X }[/math]的第一个[math]\displaystyle{ i }[/math]行的部分和。为了查找Chandler的糖尿病状态,对手执行 [math]\displaystyle{ Q_5(D_1) }[/math][math]\displaystyle{ Q_4(D_1) }[/math] ,然后计算它们的差值。在这个例子中,[math]\displaystyle{ Q_5(D_1) = 3 }[/math][math]\displaystyle{ Q_4(D_1) = 2 }[/math],所以它们的差值是1。这表示 Chandler 行中的“Has Diabetes”字段必然为1。这个例子强调了即使不显式查询特定个人的信息,个人信息也可能被泄露。

Continuing this example, if we construct [math]\displaystyle{ D_2 }[/math] by replacing (Chandler, 1) with (Chandler, 0) then this malicious adversary will be able to distinguish [math]\displaystyle{ D_2 }[/math] from [math]\displaystyle{ D_1 }[/math] by computing [math]\displaystyle{ Q_5 - Q_4 }[/math] for each dataset. If the adversary were required to receive the values [math]\displaystyle{ Q_i }[/math] via an [math]\displaystyle{ \epsilon }[/math]-differentially private algorithm, for a sufficiently small [math]\displaystyle{ \epsilon }[/math], then he or she would be unable to distinguish between the two datasets.

继续讨论这个例子,如果我们用(Chandler, 0)替换(Chandler, 1)来构造[math]\displaystyle{ D_2 }[/math],那么这个恶意的对手将能够通过计算每个数据集的[math]\displaystyle{ Q_5 - Q_4 }[/math]来区分[math]\displaystyle{ D_2 }[/math][math]\displaystyle{ D_1 }[/math]。如果对手被要求通过一个 [math]\displaystyle{ \epsilon }[/math] 差分隐私算法对于一个足够小的[math]\displaystyle{ \epsilon }[/math]接收[math]\displaystyle{ Q_i }[/math]值,那么他或她将无法区分两个数据集。

Randomized response 随机应答

A simple example, especially developed in the social sciences,[14] is to ask a person to answer the question "Do you own the attribute A?", according to the following procedure:

ju一个简单的例子,尤其是在社会科学Social Sciences领域[14],这就是指让一个人遵从下列程序回答“你拥有属性A吗?”:

  1. Toss a coin.
  2. If heads, then toss the coin again (ignoring the outcome), and answer the question honestly.
  3. If tails, then toss the coin again and answer "Yes" if heads, "No" if tails.
  1. 抛硬币Toss a Coin
  2. 如果是正面,再掷一次硬币(忽略结果),并诚实地回答问题。
  3. 如果是反面,再掷一次硬币,如果是正面,回答“是”; 如果是反面,回答“否”。

(The seemingly redundant extra toss in the first case is needed in situations where just the act of tossing a coin may be observed by others, even if the actual result stays hidden.) The confidentiality then arises from the refutability of the individual responses.

(在第一种情况下,看似多余的额外投掷是必要的,因为在这种情况下,即使实际结果仍然隐藏着,仅仅是抛硬币的动作就可能被其他人观察到。)这种保密性来自于个人反应的可驳性Refutability

But, overall, these data with many responses are significant, since positive responses are given to a quarter by people who do not have the attribute A and three-quarters by people who actually possess it. Thus, if p is the true proportion of people with A, then we expect to obtain (1/4)(1-p) + (3/4)p = (1/4) + p/2 positive responses. Hence it is possible to estimate p.

但是,总的来说,这些有很多回答的数据是有意义的,因为有四分之一的人给出了肯定的回答,而有四分之三的人给出了真正拥有A属性的人的答案。因此,如果pA型人群的真实比例,那么我们期望得到(1/4)(1-p) + (3/4)p = (1/4) + p/2的积极反应。因此,我们可以估计p

In particular, if the attribute A is synonymous with illegal behavior, then answering "Yes" is not incriminating, insofar as the person has a probability of a "Yes" response, whatever it may be.

特别是,如果属性A是非法行为的同义词,在这个人有可能回答“是”的情况下,无论结果可能是什么,回答“是”并不意味着定罪。

Although this example, inspired by randomized response, might be applicable to microdata (i.e., releasing datasets with each individual response), by definition differential privacy excludes microdata releases and is only applicable to queries (i.e., aggregating individual responses into one result) as this would violate the requirements, more specifically the plausible deniability that a subject participated or not.[15][16]

虽然这个例子受到了随机应答Randomized Response的启发,可能适用于微数据Microdata(例如,发布每个响应的数据集),但根据定义,差分隐私排除了微数据的发布,并且只适用于查询(例如,将单个响应聚合成一个结果),因为这将违反其要求,更具体地说,是对一个主体是否参与的合理否认。[15][16] --~~

Stable transformations 稳定转换

A transformation [math]\displaystyle{ T }[/math] is [math]\displaystyle{ c }[/math]-stable if the Hamming distance between [math]\displaystyle{ T(A) }[/math] and [math]\displaystyle{ T(B) }[/math] is at most [math]\displaystyle{ c }[/math]-times the Hamming distance between [math]\displaystyle{ A }[/math] and [math]\displaystyle{ B }[/math] for any two databases [math]\displaystyle{ A,B }[/math]. Theorem 2 in [10] asserts that if there is a mechanism [math]\displaystyle{ M }[/math] that is [math]\displaystyle{ \epsilon }[/math]-differentially private, then the composite mechanism [math]\displaystyle{ M\circ T }[/math] is [math]\displaystyle{ (\epsilon \times c) }[/math]-differentially private.

对于任意两个数据库[math]\displaystyle{ A,B }[/math],如果[math]\displaystyle{ T(A) }[/math][math]\displaystyle{ T(B) }[/math])之间的汉明距离最多是[math]\displaystyle{ A }[/math][math]\displaystyle{ B }[/math]之间的汉明距离Hamming Distance[math]\displaystyle{ c }[/math]倍,则变换[math]\displaystyle{ T }[/math][math]\displaystyle{ c }[/math]-稳定的。文章[10]中的定理2指出,如果存在一个机制[math]\displaystyle{ M }[/math][math]\displaystyle{ \epsilon }[/math]-差分隐私的,那么复合机制[math]\displaystyle{ M\circ T }[/math]也是[math]\displaystyle{ (\epsilon \times c) }[/math]-差分隐私的。

This could be generalized to group privacy, as the group size could be thought of as the Hamming distance [math]\displaystyle{ h }[/math] between [math]\displaystyle{ A }[/math] and [math]\displaystyle{ B }[/math] (where [math]\displaystyle{ A }[/math] contains the group and [math]\displaystyle{ B }[/math] doesn't). In this case [math]\displaystyle{ M\circ T }[/math] is [math]\displaystyle{ (\epsilon \times c \times h) }[/math]-differentially private.

这可以推广到群组体隐私,因为群体大小可以被视为[math]\displaystyle{ A }[/math][math]\displaystyle{ B }[/math]之间的汉明距离[math]\displaystyle{ h }[/math](其中[math]\displaystyle{ A }[/math]包含群组,而[math]\displaystyle{ B }[/math]没有)。在这种情况下,[math]\displaystyle{ M\circ T }[/math][math]\displaystyle{ (\epsilon \times c \times h) }[/math]-差分隐私的。

现实世界中差分隐私算法的应用

Several uses of differential privacy in practice are known to date:

  • 2008: U.S. Census Bureau, for showing commuting patterns.[17]
  • 2014: Google's RAPPOR, for telemetry such as learning statistics about unwanted software hijacking users' settings. [18][19]
  • 2015: Google, for sharing historical traffic statistics.[20]
  • 2016: Apple announced its intention to use differential privacy in iOS 10 to improve its Intelligent personal assistant technology.[21]
  • 2017: Microsoft, for telemetry in Windows.[22]
  • 2019: Privitar Lens is an API using differential privacy.[23]
  • 2020: LinkedIn, for advertiser queries.[24]

在实践中,差分隐私的几个用途已经为人所知:

  • 2008:美国人口普查局,显示通勤模式。[17]
  • 2014:谷歌的 RAPPOR,用于遥测,例如了解不受欢迎的软件劫持用户设置的统计数据。[18][19]
  • 2015:Google,分享历史流量统计数据。[20]
  • 2016:苹果公司宣布打算在iOS 10中使用差分隐私来改进其智能个人助理技术。[21]
  • 2017:微软,Windows遥测系统。[22]
  • 2019:priveritar Lens是一个使用差分隐私技术的 API。[23]
  • 2020:LinkedIn,供广告商查询。[24]

公共目标的考虑

There are several public purpose considerations regarding differential privacy that are important to consider, especially for policymakers and policy-focused audiences interested in the social opportunities and risks of the technology:[25]

关于差分隐私技术,有几个重要的公共目标的问题需要考虑,尤其是对于那些对此技术的社会机遇和风险感兴趣的决策者和政策聚焦的受众而言:[25]

  • Data Utility & Accuracy. The main concern with differential privacy is the tradeoff between data utility and individual privacy. If the privacy loss parameter is set to favor utility, the privacy benefits are lowered (less “noise” is injected into the system); if the privacy loss parameter is set to favor heavy privacy, the accuracy and utility of the dataset are lowered (more “noise” is injected into the system). It is important for policymakers to consider the tradeoffs posed by differential privacy in order to help set appropriate best practices and standards around the use of this privacy preserving practice, especially considering the diversity in organizational use cases. It is worth noting, though, that decreased accuracy and utility is a common issue among all statistical disclosure limitation methods and is not unique to differential privacy. What is unique, however, is how policymakers, researchers, and implementers can consider mitigating against the risks presented through this tradeoff.
  • 数据的效用及准确性。差分隐私的主要关注点在于数据效用和个人隐私之间的权衡。如果将隐私损失参数设置为有利于实用性,则隐私性降低(向系统中加入的“噪音”较少); 如果将隐私损失参数设置为有利于隐私性,则数据集的准确性和效用降低(向系统中注入更多的“噪音”)。对于决策者来说,重要的是要考虑差分隐私的权衡,以帮助建立适当的最佳实践标准来使用这种隐私保护机制,特别是考虑到组织用例的多样性。但是值得注意的是,准确性和效用的降低是所有的限制统计数据泄露的方法的共同问题,而并非只属于差分隐私。然而,独特之处在于,决策者、研究人员和实施者可以考虑如何减轻这种权衡带来的风险。
  • Data Privacy & Security. Differential privacy provides a quantified measure of privacy loss and an upper bound and allows curators to choose the explicit tradeoff between privacy and accuracy. It is robust to still unknown privacy attacks. However, it encourages greater data sharing, which if done poorly, increases privacy risk. Differential privacy implies that privacy is protected, but this depends very much on the privacy loss parameter chosen and may instead lead to a false sense of security. Finally, though it is robust against unforeseen future privacy attacks, a countermeasure may be devised that we cannot predict.
  • 数据隐私及安全。差分隐私提供了一个量化的隐私损失度量和上限,并允许管理者在隐私和准确性之间做出明确的权衡。它对仍然未知的隐私攻击具有鲁棒性。然而,它鼓励更大的数据共享——如果做得不好,会增加隐私风险。差分隐私意味着隐私是受到保护的,但这在很大程度上取决于选择的隐私损失参数,并可能会导致对安全性的错误认识。最后,尽管它对未来不可预见的隐私攻击具有鲁棒性,但可能会有我们无法预测的对策被设计出来。

另见

参考文献

引用错误:Closing tag missing for <references>


Further reading 进一步阅读