第194行: |
第194行: |
| | | |
| | | |
− | [[File:Binary entropy plot.svg|thumbnail|right|200px|The entropy of a [[Bernoulli trial]] as a function of success probability, often called the {{em|[[binary entropy function]]}}, {{math|''H''<sub>b</sub>(''p'')}}. The entropy is maximized at 1 bit per trial when the two possible outcomes are equally probable, as in an unbiased coin toss.]] | + | [[File:Binary entropy plot.svg|thumbnail|right|200px|The entropy of a [[Bernoulli trial]] as a function of success probability, often called the {{em|[[binary entropy function]]}}, {{math|''H''<sub>b</sub>(''p'')}}. The entropy is maximized at 1 bit per trial when the two possible outcomes are equally probable, as in an unbiased coin toss. 伯努利实验的熵,作为一个成功概率的函数,通常被称为二值熵函数, {{math|''H''<sub>b</sub>(''p'')}}。当使用一个无偏的硬币做实验时,两个可能结果出现的概率相等,此时的熵值最大,为1。]] |
− | | |
− | The entropy of a [[Bernoulli trial as a function of success probability, often called the , . The entropy is maximized at 1 bit per trial when the two possible outcomes are equally probable, as in an unbiased coin toss.]]
| |
− | | |
− | [[File:Binary entropy plot.svg|thumbnail|right|200px|以[[Bernoulli trial]]的熵作为成功概率的函数,通常称作{{em|[[binary entropy function]]}}, {{math|''H''<sub>b</sub>(''p'')}}。当两个可能的结果发生概率相等时(例如投掷无偏硬币),每次试验的熵最大为1比特。]]
| |
− | | |
| | | |
| | | |
第226行: |
第221行: |
| | | |
| 对于只有两种可能取值的随机变量的信息熵,其特殊情况为二值熵函数(通常用以为底2对数,因此以香农(Sh)为单位): | | 对于只有两种可能取值的随机变量的信息熵,其特殊情况为二值熵函数(通常用以为底2对数,因此以香农(Sh)为单位): |
− |
| |
| | | |
| :<math>H_{\mathrm{b}}(p) = - p \log_2 p - (1-p)\log_2 (1-p).</math> | | :<math>H_{\mathrm{b}}(p) = - p \log_2 p - (1-p)\log_2 (1-p).</math> |
| | | |
− | ===Joint entropy===
| |
| | | |
− | ===Joint entropy=== | + | ===联合熵=== |
− | | |
− | 联合熵
| |
| | | |
| The {{em|[[joint entropy]]}} of two discrete random variables {{math|''X''}} and {{math|''Y''}} is merely the entropy of their pairing: {{math|(''X'', ''Y'')}}. This implies that if {{math|''X''}} and {{math|''Y''}} are [[statistical independence|independent]], then their joint entropy is the sum of their individual entropies. | | The {{em|[[joint entropy]]}} of two discrete random variables {{math|''X''}} and {{math|''Y''}} is merely the entropy of their pairing: {{math|(''X'', ''Y'')}}. This implies that if {{math|''X''}} and {{math|''Y''}} are [[statistical independence|independent]], then their joint entropy is the sum of their individual entropies. |
第240行: |
第231行: |
| The of two discrete random variables and is merely the entropy of their pairing: . This implies that if and are independent, then their joint entropy is the sum of their individual entropies. | | The of two discrete random variables and is merely the entropy of their pairing: . This implies that if and are independent, then their joint entropy is the sum of their individual entropies. |
| | | |
− | 两个离散的随机变量{{math|''X''}}和{{math|''Y''}}的联合熵大致是它们的组合熵: {{math|(''X'', ''Y'')}}。若{{math|''X''}}和{{math|''Y''}}是独立的,那么它们的联合熵就是其各自熵的总和。 | + | 两个离散的随机变量{{math|''X''}}和{{math|''Y''}}的联合熵大致是它们的联合熵: {{math|(''X'', ''Y'')}}。若{{math|''X''}}和{{math|''Y''}}是独立的,那么它们的联合熵就是其各自熵的总和。 |
− | | |
− | | |
− | | |
| | | |
| | | |
第252行: |
第240行: |
| 例如:如果{{math|(''X'', ''Y'')}}代表棋子的位置({{math|''X''}} 表示行和{{math|''Y''}}表示列),那么棋子所在位置的熵就是棋子行、列的联合熵。 | | 例如:如果{{math|(''X'', ''Y'')}}代表棋子的位置({{math|''X''}} 表示行和{{math|''Y''}}表示列),那么棋子所在位置的熵就是棋子行、列的联合熵。 |
| | | |
− |
| |
− |
| |
− |
| |
− |
| |
− | :<math>H(X, Y) = \mathbb{E}_{X,Y} [-\log p(x,y)] = - \sum_{x, y} p(x, y) \log p(x, y) \,</math>
| |
− |
| |
− | <math>H(X, Y) = \mathbb{E}_{X,Y} [-\log p(x,y)] = - \sum_{x, y} p(x, y) \log p(x, y) \,</math>
| |
| | | |
| :<math>H(X, Y) = \mathbb{E}_{X,Y} [-\log p(x,y)] = - \sum_{x, y} p(x, y) \log p(x, y) \,</math> | | :<math>H(X, Y) = \mathbb{E}_{X,Y} [-\log p(x,y)] = - \sum_{x, y} p(x, y) \log p(x, y) \,</math> |
− |
| |
− |
| |
− |
| |
− |
| |
| | | |
| Despite similar notation, joint entropy should not be confused with {{em|[[cross entropy]]}}. | | Despite similar notation, joint entropy should not be confused with {{em|[[cross entropy]]}}. |
第273行: |
第250行: |
| | | |
| | | |
− | | + | ===条件熵(含糊度)=== |
− | | |
− | | |
− | ===Conditional entropy (equivocation)=== | |
− | | |
− | ===Conditional entropy (equivocation)===
| |
− | | |
− | 条件熵(含糊度)
| |
| | | |
| The {{em|[[conditional entropy]]}} or ''conditional uncertainty'' of {{math|''X''}} given random variable {{math|''Y''}} (also called the ''equivocation'' of {{math|''X''}} about {{math|''Y''}}) is the average conditional entropy over {{math|''Y''}}:<ref name=Ash>{{cite book | title = Information Theory | author = Robert B. Ash | publisher = Dover Publications, Inc. | origyear = 1965| year = 1990 | isbn = 0-486-66521-6 | url = https://books.google.com/books?id=ngZhvUfF0UIC&pg=PA16&dq=intitle:information+intitle:theory+inauthor:ash+conditional+uncertainty}}</ref> | | The {{em|[[conditional entropy]]}} or ''conditional uncertainty'' of {{math|''X''}} given random variable {{math|''Y''}} (also called the ''equivocation'' of {{math|''X''}} about {{math|''Y''}}) is the average conditional entropy over {{math|''Y''}}:<ref name=Ash>{{cite book | title = Information Theory | author = Robert B. Ash | publisher = Dover Publications, Inc. | origyear = 1965| year = 1990 | isbn = 0-486-66521-6 | url = https://books.google.com/books?id=ngZhvUfF0UIC&pg=PA16&dq=intitle:information+intitle:theory+inauthor:ash+conditional+uncertainty}}</ref> |
| | | |
| The or conditional uncertainty of given random variable (also called the equivocation of about ) is the average conditional entropy over : | | The or conditional uncertainty of given random variable (also called the equivocation of about ) is the average conditional entropy over : |
− | 在给定随机变量{{math|''Y''}}下{{math|''X''}}的条件熵(或条件不确定性,也可称为{{math|''X''}}关于{{math|''Y''}}的含糊度))是{{math|''Y''}}上的条件平均熵: <ref name=Ash>{{cite book | title = Information Theory | author = Robert B. Ash | publisher = Dover Publications, Inc. | origyear = 1965| year = 1990 | isbn = 0-486-66521-6 | url = https://books.google.com/books?id=ngZhvUfF0UIC&pg=PA16&dq=intitle:information+intitle:theory+inauthor:ash+conditional+uncertainty}}</ref>
| |
| | | |
− | | + | 在给定随机变量{{math|''Y''}}下{{math|''X''}}的条件熵(或条件不确定性,也可称为{{math|''X''}}关于{{math|''Y''}}的含糊度))是{{math|''Y''}}上的平均条件熵: <ref name=Ash>{{cite book | title = Information Theory | author = Robert B. Ash | publisher = Dover Publications, Inc. | origyear = 1965| year = 1990 | isbn = 0-486-66521-6 | url = https://books.google.com/books?id=ngZhvUfF0UIC&pg=PA16&dq=intitle:information+intitle:theory+inauthor:ash+conditional+uncertainty}}</ref> |
− | | |
− | | |
− | | |
− | | |
− | :<math> H(X|Y) = \mathbb E_Y [H(X|y)] = -\sum_{y \in Y} p(y) \sum_{x \in X} p(x|y) \log p(x|y) = -\sum_{x,y} p(x,y) \log p(x|y).</math>
| |
− | | |
− | <math> H(X|Y) = \mathbb E_Y [H(X|y)] = -\sum_{y \in Y} p(y) \sum_{x \in X} p(x|y) \log p(x|y) = -\sum_{x,y} p(x,y) \log p(x|y).</math>
| |
| | | |
| :<math> H(X|Y) = \mathbb E_Y [H(X|y)] = -\sum_{y \in Y} p(y) \sum_{x \in X} p(x|y) \log p(x|y) = -\sum_{x,y} p(x,y) \log p(x|y).</math> | | :<math> H(X|Y) = \mathbb E_Y [H(X|y)] = -\sum_{y \in Y} p(y) \sum_{x \in X} p(x|y) \log p(x|y) = -\sum_{x,y} p(x,y) \log p(x|y).</math> |
− |
| |
− |
| |
− |
| |
− |
| |
| | | |
| Because entropy can be conditioned on a random variable or on that random variable being a certain value, care should be taken not to confuse these two definitions of conditional entropy, the former of which is in more common use. A basic property of this form of conditional entropy is that: | | Because entropy can be conditioned on a random variable or on that random variable being a certain value, care should be taken not to confuse these two definitions of conditional entropy, the former of which is in more common use. A basic property of this form of conditional entropy is that: |
第307行: |
第265行: |
| | | |
| 由于熵能够以随机变量或该随机变量的某个值为条件,所以应注意不要混淆条件熵的这两个定义(前者更为常用)。该类条件熵的一个基本属性为: | | 由于熵能够以随机变量或该随机变量的某个值为条件,所以应注意不要混淆条件熵的这两个定义(前者更为常用)。该类条件熵的一个基本属性为: |
− |
| |
− |
| |
− |
| |
− |
| |
− |
| |
− | : <math> H(X|Y) = H(X,Y) - H(Y) .\,</math>
| |
− |
| |
− | <math> H(X|Y) = H(X,Y) - H(Y) .\,</math>
| |
| | | |
| : <math> H(X|Y) = H(X,Y) - H(Y) .\,</math> | | : <math> H(X|Y) = H(X,Y) - H(Y) .\,</math> |
| | | |
| | | |
− | | + | ===互信息(转移信息)=== |
− | | |
− | | |
− | ===Mutual information (transinformation)=== | |
− | | |
− | ===Mutual information (transinformation)===
| |
− | | |
− | 互信息(转移信息)
| |
| | | |
| ''[[Mutual information]]'' measures the amount of information that can be obtained about one random variable by observing another. It is important in communication where it can be used to maximize the amount of information shared between sent and received signals. The mutual information of {{math|''X''}} relative to {{math|''Y''}} is given by: | | ''[[Mutual information]]'' measures the amount of information that can be obtained about one random variable by observing another. It is important in communication where it can be used to maximize the amount of information shared between sent and received signals. The mutual information of {{math|''X''}} relative to {{math|''Y''}} is given by: |
第332行: |
第275行: |
| Mutual information measures the amount of information that can be obtained about one random variable by observing another. It is important in communication where it can be used to maximize the amount of information shared between sent and received signals. The mutual information of relative to is given by: | | Mutual information measures the amount of information that can be obtained about one random variable by observing another. It is important in communication where it can be used to maximize the amount of information shared between sent and received signals. The mutual information of relative to is given by: |
| | | |
− | 互信息度量的是通过观察另一个随机变量可以获得的信息量。在通信中可以用它来最大化发送和接收信号之间共享的信息量,这一点至关重要。{{math|''X''}}相对于{{math|''X''}}的互信息由以下公式给出:
| + | 互信息度量的是某个随机变量在通过观察另一个随机变量时可以获得的信息量。在通信中可以用它来最大化发送和接收信号之间共享的信息量,这一点至关重要。{{math|''X''}}相对于{{math|''Y''}}的互信息由以下公式给出: |
− | | |
− | | |
− | | |
− | | |
− | | |
− | :<math>I(X;Y) = \mathbb{E}_{X,Y} [SI(x,y)] = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)\, p(y)}</math>
| |
| | | |
− | <math>I(X;Y) = \mathbb{E}_{X,Y} [SI(x,y)] = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)\, p(y)}</math>
| |
| | | |
| :<math>I(X;Y) = \mathbb{E}_{X,Y} [SI(x,y)] = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)\, p(y)}</math> | | :<math>I(X;Y) = \mathbb{E}_{X,Y} [SI(x,y)] = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)\, p(y)}</math> |
第349行: |
第285行: |
| | | |
| 其中{{math|SI}} (Specific mutual Information,特定互信息)是点间的互信息。 | | 其中{{math|SI}} (Specific mutual Information,特定互信息)是点间的互信息。 |
− |
| |
− |
| |
− |
| |
| | | |
| | | |
第358行: |
第291行: |
| A basic property of the mutual information is that | | A basic property of the mutual information is that |
| | | |
− | 互信息的一个基本属性为:
| + | 互信息的一个基本属性是: |
| | | |
| : <math>I(X;Y) = H(X) - H(X|Y).\,</math> | | : <math>I(X;Y) = H(X) - H(X|Y).\,</math> |
| | | |
− | <math>I(X;Y) = H(X) - H(X|Y).\,</math>
| |
− |
| |
− | : <math>I(X;Y) = H(X) - H(X|Y).\,</math>
| |
| | | |
| That is, knowing ''Y'', we can save an average of {{math|''I''(''X''; ''Y'')}} bits in encoding ''X'' compared to not knowing ''Y''. | | That is, knowing ''Y'', we can save an average of {{math|''I''(''X''; ''Y'')}} bits in encoding ''X'' compared to not knowing ''Y''. |
第370行: |
第300行: |
| That is, knowing Y, we can save an average of bits in encoding X compared to not knowing Y. | | That is, knowing Y, we can save an average of bits in encoding X compared to not knowing Y. |
| | | |
− | 也就是说,在编码的过程中知道''Y''比不知道''Y''平均节省{{math|''I''(''X''; ''Y'')}}比特。
| + | 也就是说,在编码''X''的过程中,知道''Y''比不知道''Y''平均节省{{math|''I''(''X''; ''Y'')}}比特。 |
− | | |
− | | |
− | | |
| | | |
| | | |
第383行: |
第310行: |
| | | |
| : <math>I(X;Y) = I(Y;X) = H(X) + H(Y) - H(X,Y).\,</math> | | : <math>I(X;Y) = I(Y;X) = H(X) + H(Y) - H(X,Y).\,</math> |
− |
| |
− | <math>I(X;Y) = I(Y;X) = H(X) + H(Y) - H(X,Y).\,</math>
| |
− |
| |
− | : <math>I(X;Y) = I(Y;X) = H(X) + H(Y) - H(X,Y).\,</math>
| |
− |
| |
− |
| |
− |
| |
| | | |
| | | |
第396行: |
第316行: |
| Mutual information can be expressed as the average Kullback–Leibler divergence (information gain) between the posterior probability distribution of X given the value of Y and the prior distribution on X: | | Mutual information can be expressed as the average Kullback–Leibler divergence (information gain) between the posterior probability distribution of X given the value of Y and the prior distribution on X: |
| | | |
− | 互信息可以表示为在给定''Y''值和''X''的后验分布的情况下,''X''的后验概率之间的平均 Kullback-Leibler 散度(信息增益) : | + | 互信息可以表示为在给定''Y''值的情况下''X''的后验分布,以及''X''的先验概率分布之间的平均 Kullback-Leibler 散度(信息增益) : |
| | | |
| : <math>I(X;Y) = \mathbb E_{p(y)} [D_{\mathrm{KL}}( p(X|Y=y) \| p(X) )].</math> | | : <math>I(X;Y) = \mathbb E_{p(y)} [D_{\mathrm{KL}}( p(X|Y=y) \| p(X) )].</math> |
| | | |
− | <math>I(X;Y) = \mathbb E_{p(y)} [D_{\mathrm{KL}}( p(X|Y=y) \| p(X) )].</math>
| |
− |
| |
− | : <math>I(X;Y) = \mathbb E_{p(y)} [D_{\mathrm{KL}}( p(X|Y=y) \| p(X) )].</math>
| |
| | | |
| In other words, this is a measure of how much, on the average, the probability distribution on ''X'' will change if we are given the value of ''Y''. This is often recalculated as the divergence from the product of the marginal distributions to the actual joint distribution: | | In other words, this is a measure of how much, on the average, the probability distribution on ''X'' will change if we are given the value of ''Y''. This is often recalculated as the divergence from the product of the marginal distributions to the actual joint distribution: |
第408行: |
第325行: |
| In other words, this is a measure of how much, on the average, the probability distribution on X will change if we are given the value of Y. This is often recalculated as the divergence from the product of the marginal distributions to the actual joint distribution: | | In other words, this is a measure of how much, on the average, the probability distribution on X will change if we are given the value of Y. This is often recalculated as the divergence from the product of the marginal distributions to the actual joint distribution: |
| | | |
− | 换句话说这是一种度量方法:若我们给出''Y''的值,得出''X''上的概率分布将会平均变化多少。这通常用于计算边际分布的乘积与实际联合分布的差异:
| + | 换句话说,这个指标度量:当我们给出''Y''的值,得出''X''上的概率分布将会平均变化多少。这通常用于计算边缘分布的乘积与实际联合分布的差异: |
| | | |
| : <math>I(X; Y) = D_{\mathrm{KL}}(p(X,Y) \| p(X)p(Y)).</math> | | : <math>I(X; Y) = D_{\mathrm{KL}}(p(X,Y) \| p(X)p(Y)).</math> |
− |
| |
− | <math>I(X; Y) = D_{\mathrm{KL}}(p(X,Y) \| p(X)p(Y)).</math>
| |
− |
| |
− | : <math>I(X; Y) = D_{\mathrm{KL}}(p(X,Y) \| p(X)p(Y)).</math>
| |
− |
| |
− |
| |
− |
| |
| | | |
| | | |
第424行: |
第334行: |
| Mutual information is closely related to the log-likelihood ratio test in the context of contingency tables and the multinomial distribution and to Pearson's χ<sup>2</sup> test: mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution. | | Mutual information is closely related to the log-likelihood ratio test in the context of contingency tables and the multinomial distribution and to Pearson's χ<sup>2</sup> test: mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution. |
| | | |
− | 互信息与列联表中的似然比检验和多项分布以及皮尔森卡方检验密切相关: 互信息可以视为评估一对变量之间独立性的统计量,并且具有明确指定的渐近分布。
| + | 互信息与列联表中的似然比检验,多项分布,以及皮尔森卡方检验密切相关: 互信息可以视为评估一对变量之间独立性的统计量,并且具有明确指定的渐近分布。 |
− | | |
− | | |
− | | |
− | | |
| | | |
− | ===Kullback–Leibler divergence (information gain)===
| |
| | | |
− | ===Kullback–Leibler divergence (information gain)=== | + | ===Kullback-Leibler 散度(信息增益)=== |
− | | |
− | Kullback-Leibler 散度(信息增益) | |
| | | |
| The ''[[Kullback–Leibler divergence]]'' (or ''information divergence'', ''information gain'', or ''relative entropy'') is a way of comparing two distributions: a "true" [[probability distribution]] ''p(X)'', and an arbitrary probability distribution ''q(X)''. If we compress data in a manner that assumes ''q(X)'' is the distribution underlying some data, when, in reality, ''p(X)'' is the correct distribution, the Kullback–Leibler divergence is the number of average additional bits per datum necessary for compression. It is thus defined | | The ''[[Kullback–Leibler divergence]]'' (or ''information divergence'', ''information gain'', or ''relative entropy'') is a way of comparing two distributions: a "true" [[probability distribution]] ''p(X)'', and an arbitrary probability distribution ''q(X)''. If we compress data in a manner that assumes ''q(X)'' is the distribution underlying some data, when, in reality, ''p(X)'' is the correct distribution, the Kullback–Leibler divergence is the number of average additional bits per datum necessary for compression. It is thus defined |
第441行: |
第344行: |
| | | |
| Kullback-Leibler 散度(或信息散度、相对熵、信息增益)是比较两种分布的方法: “真实的”概率分布''p(X)''和任意概率分布''q(X)''。若假设''q(X)''是基于某种方式压缩的数据的分布,而实际上''p(X)''才是真正分布,那么 Kullback-Leibler 散度是每个数据压缩所需的平均额外比特数。因此定义: | | Kullback-Leibler 散度(或信息散度、相对熵、信息增益)是比较两种分布的方法: “真实的”概率分布''p(X)''和任意概率分布''q(X)''。若假设''q(X)''是基于某种方式压缩的数据的分布,而实际上''p(X)''才是真正分布,那么 Kullback-Leibler 散度是每个数据压缩所需的平均额外比特数。因此定义: |
− |
| |
− |
| |
− |
| |
− |
| |
| | | |
| :<math>D_{\mathrm{KL}}(p(X) \| q(X)) = \sum_{x \in X} -p(x) \log {q(x)} \, - \, \sum_{x \in X} -p(x) \log {p(x)} = \sum_{x \in X} p(x) \log \frac{p(x)}{q(x)}.</math> | | :<math>D_{\mathrm{KL}}(p(X) \| q(X)) = \sum_{x \in X} -p(x) \log {q(x)} \, - \, \sum_{x \in X} -p(x) \log {p(x)} = \sum_{x \in X} p(x) \log \frac{p(x)}{q(x)}.</math> |
− |
| |
− | <math>D_{\mathrm{KL}}(p(X) \| q(X)) = \sum_{x \in X} -p(x) \log {q(x)} \, - \, \sum_{x \in X} -p(x) \log {p(x)} = \sum_{x \in X} p(x) \log \frac{p(x)}{q(x)}.</math>
| |
− |
| |
− | :<math>D_{\mathrm{KL}}(p(X) \| q(X)) = \sum_{x \in X} -p(x) \log {q(x)} \, - \, \sum_{x \in X} -p(x) \log {p(x)} = \sum_{x \in X} p(x) \log \frac{p(x)}{q(x)}.</math>
| |
− |
| |
− |
| |
− |
| |
| | | |
| | | |
第460行: |
第352行: |
| Although it is sometimes used as a 'distance metric', KL divergence is not a true metric since it is not symmetric and does not satisfy the triangle inequality (making it a semi-quasimetric). | | Although it is sometimes used as a 'distance metric', KL divergence is not a true metric since it is not symmetric and does not satisfy the triangle inequality (making it a semi-quasimetric). |
| | | |
− | 尽管有时会将KL散度用作距离量度但它并不是一个真正的指标,因为它是不对称的,同时也不满足三角不等式(KL散度为一个半准度量)。 | + | 尽管有时会将KL散度用作距离量度但它并不是一个真正的指标,因为它是不对称的,同时也不满足三角不等式(KL散度可以作为一个半准度量)。 |
− | | |
− | | |
− | | |
| | | |
| | | |
第470行: |
第359行: |
| Another interpretation of the KL divergence is the "unnecessary surprise" introduced by a prior from the truth: suppose a number X is about to be drawn randomly from a discrete set with probability distribution p(x). If Alice knows the true distribution p(x), while Bob believes (has a prior) that the distribution is q(x), then Bob will be more surprised than Alice, on average, upon seeing the value of X. The KL divergence is the (objective) expected value of Bob's (subjective) surprisal minus Alice's surprisal, measured in bits if the log is in base 2. In this way, the extent to which Bob's prior is "wrong" can be quantified in terms of how "unnecessarily surprised" it is expected to make him. | | Another interpretation of the KL divergence is the "unnecessary surprise" introduced by a prior from the truth: suppose a number X is about to be drawn randomly from a discrete set with probability distribution p(x). If Alice knows the true distribution p(x), while Bob believes (has a prior) that the distribution is q(x), then Bob will be more surprised than Alice, on average, upon seeing the value of X. The KL divergence is the (objective) expected value of Bob's (subjective) surprisal minus Alice's surprisal, measured in bits if the log is in base 2. In this way, the extent to which Bob's prior is "wrong" can be quantified in terms of how "unnecessarily surprised" it is expected to make him. |
| | | |
− | KL散度的另一种解释是先验者从事实中引入的“不必要的惊喜”。假设将从概率分布为“ p(x)”的离散集合中随机抽取数字“ X”,如果Alice知道真实的分布“p(x)”,而Bob认为“q(x)”具有先验概率分布,那么Bob将比Alice更多次看到“X”的值,Bob也将具有更多的信息内容。KL散度就是Bob意外惊喜的期望值减去Alice意外惊喜的期望值(如果对数以2为底,则以比特为单位),这样Bob的先验是“错误的”程度可以根据他变得“不必要的惊讶”的期望来进行量化。
| + | KL散度的另一种解释是一种先验知识引入的“不必要的惊讶”。假设将从概率分布为“ p(x)”的离散集合中随机抽取数字“ X”,如果Alice知道真实的分布“p(x)”,而Bob(因为具有先验知识)认为概率分布是“q(x)”,那么在看到抽取出来的''X''的值后,平均而言,Bob将比Alice更加惊讶。KL散度就是Bob惊讶的期望值减去Alice惊讶的期望值(如果对数以2为底,则以比特为单位),这样Bob所拥有的先验知识的“错误的”程度可以用他“不必要的惊讶”的期望值来进行量化。 |
− | | |
− | | |
| | | |
− | ===Other quantities===
| |
| | | |
− | ===Other quantities===
| |
| | | |
− | 其他度量 | + | ===其他度量=== |
| | | |
| Other important information theoretic quantities include [[Rényi entropy]] (a generalization of entropy), [[differential entropy]] (a generalization of quantities of information to continuous distributions), and the [[conditional mutual information]]. | | Other important information theoretic quantities include [[Rényi entropy]] (a generalization of entropy), [[differential entropy]] (a generalization of quantities of information to continuous distributions), and the [[conditional mutual information]]. |