相对熵

来自集智百科 - 伊辛模型
跳到导航 跳到搜索

此词条暂由彩云小译翻译,翻译字数共4833,未经人工整理和审校,带来阅读不便,请见谅。

模板:Information theory

In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy) is a measure of how one probability distribution is different from a second, reference probability distribution.[1][2] Applications include characterizing the relative (Shannon) entropy in information systems, randomness in continuous time-series, and information gain when comparing statistical models of inference. In contrast to variation of information, it is a distribution-wise asymmetric measure and thus does not qualify as a statistical metric of spread - it also does not satisfy the triangle inequality. In the simple case, a Kullback–Leibler divergence of 0 indicates that the two distributions in question are identical. In simplified terms, it is a measure of surprise, with diverse applications such as applied statistics, fluid mechanics, neuroscience and machine learning.


which is equivalent to

这相当于

Introduction and context

Let's consider two distributions of probability [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math]. Usually, [math]\displaystyle{ P }[/math] represents the data, the observations, or a probability distribution precisely measured. Distribution [math]\displaystyle{ Q }[/math] represents instead a theory, a model, a description or an approximation of [math]\displaystyle{ P }[/math]. The Kullback–Leibler divergence is then interpreted as the average difference of the number of bits required for encoding samples of [math]\displaystyle{ P }[/math] using a code optimized for [math]\displaystyle{ Q }[/math] rather than one optimized for [math]\displaystyle{ P }[/math].

[math]\displaystyle{ D_\text{KL}(P \parallel Q) = -\sum_{x\in\mathcal{X}} P(x) \log\left(\frac{Q(x)}{P(x)}\right) }[/math]

< math > d _ text { KL }(p 并行 q) =-sum _ { x in mathcal { x } p (x) log left (frac { q (x)}{ p (x)}右) </math >


Etymology

In other words, it is the expectation of the logarithmic difference between the probabilities [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math], where the expectation is taken using the probabilities [math]\displaystyle{ P }[/math]. The Kullback–Leibler divergence is defined only if for all [math]\displaystyle{ x }[/math], [math]\displaystyle{ Q(x) = 0 }[/math] implies [math]\displaystyle{ P(x) = 0 }[/math] (absolute continuity). Whenever [math]\displaystyle{ P(x) }[/math] is zero the contribution of the corresponding term is interpreted as zero because

换句话说,它是概率 < math > p </math > 和 < math > q </math > 之间的对数差的期望,这里的期望是用概率 < math > p </math > 。只有当所有 < math > x </math > ,< math > q (x) = 0 </math > 暗示 < math > p (x) = 0 </math > (绝对连续)时,Kullback-Leibler 分歧才被定义。每当 < math > p (x) </math > 为零时,对应项的贡献就被解释为零,因为

The Kullback–Leibler divergence was introduced by Solomon Kullback and Richard Leibler in 1951 as the directed divergence between two distributions; Kullback preferred the term discrimination information.[3] The divergence is discussed in Kullback's 1959 book, Information Theory and Statistics.[2]


More generally, if [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] are probability measures over a set [math]\displaystyle{ {\mathcal{X}} }[/math], and [math]\displaystyle{ P }[/math] is absolutely continuous with respect to [math]\displaystyle{ Q }[/math], then the Kullback–Leibler divergence from [math]\displaystyle{ Q }[/math] to [math]\displaystyle{ P }[/math] is defined as

更一般地说,如果 < math > p </math > 和 < math > q </math > 是集合 < math > { cal { x } </math > 上的概率测度,而 < math > p </math > 在 < math > q </math > 上是绝对连续的,那么 Kullback-Leibler 从 math < q </math > 到 < math > p </math > 的分歧被定义为

Definition

For discrete probability distributions [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] defined on the same probability space, [math]\displaystyle{ \mathcal{X} }[/math], the Kullback–Leibler divergence from [math]\displaystyle{ Q }[/math] to [math]\displaystyle{ P }[/math] is defined[4] to be

[math]\displaystyle{ D_\text{KL}(P \parallel Q) = \int_{\mathcal{X}} \log\left(\frac{dP}{dQ}\right)\, dP, }[/math]

(p 并行 q) = int _ { mathcal { x } log left (frac { dP }{ dQ } right) ,dP,</math >


[math]\displaystyle{ D_\text{KL}(P \parallel Q) = \sum_{x\in\mathcal{X}} P(x) \log\left(\frac{P(x)}{Q(x)}\right). }[/math]

where [math]\displaystyle{ \frac{dP}{dQ} }[/math] is the Radon–Nikodym derivative of [math]\displaystyle{ P }[/math] with respect to [math]\displaystyle{ Q }[/math], and provided the expression on the right-hand side exists. Equivalently (by the chain rule), this can be written as

其中关于 < math > q </math > 的 Radon-Nikodym 导数{ dQ } </math > ,并提供了右边存在的表达式。等价地(按照链规则) ,可以将其写为


which is equivalent to

[math]\displaystyle{ D_\text{KL}(P \parallel Q) = \int_{\mathcal{X}} \log\left(\frac{dP}{dQ}\right) \frac{dP}{dQ}\, dQ, }[/math]

(p 并行 q) = int _ { mathcal { x } log left (frac { dP }{ dQ } right) frac { dP }{ dQ } ,dQ,</math >


[math]\displaystyle{ D_\text{KL}(P \parallel Q) = -\sum_{x\in\mathcal{X}} P(x) \log\left(\frac{Q(x)}{P(x)}\right) }[/math]

which is the entropy of [math]\displaystyle{ Q }[/math] relative to [math]\displaystyle{ P }[/math]. Continuing in this case, if [math]\displaystyle{ \mu }[/math] is any measure on [math]\displaystyle{ \mathcal{X} }[/math] for which [math]\displaystyle{ p = \frac{dP}{d\mu} }[/math] and [math]\displaystyle{ q = \frac{dQ}{d\mu} }[/math] exist (meaning that [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] are absolutely continuous with respect to [math]\displaystyle{ \mu }[/math]), then the Kullback–Leibler divergence from [math]\displaystyle{ Q }[/math] to [math]\displaystyle{ P }[/math] is given as

这是相对于 < math > p </math > 的熵。在这种情况下,如果 < math > mu </math > 是 < math > 数学 > x } </math > 上的任何测量,那么 < math > p = frac { dP }{ d mu } </math > 和 < math > q = frac { dQ }{ d mu } </math > 存在(意味着 < math > p </math > 和 < math > q </math > 在 < math > mu </math > 方面是绝对连续的) ,那么从 < math > q </math > 到 math < p </math > 的 Kullback 分歧被给出为


In other words, it is the expectation of the logarithmic difference between the probabilities [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math], where the expectation is taken using the probabilities [math]\displaystyle{ P }[/math]. The Kullback–Leibler divergence is defined only if for all [math]\displaystyle{ x }[/math], [math]\displaystyle{ Q(x) = 0 }[/math] implies [math]\displaystyle{ P(x) = 0 }[/math] (absolute continuity). Whenever [math]\displaystyle{ P(x) }[/math] is zero the contribution of the corresponding term is interpreted as zero because

[math]\displaystyle{ D_\text{KL}(P \parallel Q) = \int_{\mathcal{X}} p \log\left(\frac{p}{q}\right)\, d\mu. }[/math]

(p 并行 q) = int _ { mathcal { x } p log left (frac { p }{ q } right) ,d mu. </math >


[math]\displaystyle{ \lim_{x \to 0^{+}} x \log(x) = 0. }[/math]

The logarithms in these formulae are taken to base 2 if information is measured in units of bits, or to base [math]\displaystyle{ e }[/math] if information is measured in nats. Most formulas involving the Kullback–Leibler divergence hold regardless of the base of the logarithm.

如果信息是以位为单位来度量的,则这些公式中的对数以2为底; 如果信息是以 nats 为底,则以“ math”为底。大多数包含 Kullback-Leibler 散度的公式不管对数的底是多少都成立。


For distributions [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] of a continuous random variable, the Kullback–Leibler divergence is defined to be the integral:[5]:p. 55

Various conventions exist for referring to [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math] in words. Often it is referred to as the divergence between [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math], but this fails to convey the fundamental asymmetry in the relation. Sometimes, as in this article, it may be described as the divergence of [math]\displaystyle{ P }[/math] from [math]\displaystyle{ Q }[/math] or as the divergence from [math]\displaystyle{ Q }[/math] to [math]\displaystyle{ P }[/math]. This reflects the asymmetry in Bayesian inference, which starts from a prior [math]\displaystyle{ Q }[/math] and updates to the posterior [math]\displaystyle{ P }[/math]. Another common way to refer to [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math] is as the relative entropy of [math]\displaystyle{ P }[/math] with respect to [math]\displaystyle{ Q }[/math].

在单词中引用 < math > d _ text { KL }(p parallelq) </math > 时存在各种约定。通常它被称为 < math > p </math > 和 < math > q </math > 之间的差异,但这并不能传达关系中的基本不对称性。有时候,正如本文所述,它可以被描述为从 < math > q </math > 到 < math > p </math > 或者从 < math > q </math > 到 < math > p </math > 的分歧。这反映了贝叶斯推断的不对称性,这种不对称性始于先前的 q </math > 和后来的 p </math > 。另一种常见的引用 < math > d text { KL }(p parallal q) </math > 的方法是作为 < math > p </math > 关于 < math > q </math > 的相对熵。


[math]\displaystyle{ D_\text{KL}(P \parallel Q) = \int_{-\infty}^\infty p(x) \log\left(\frac{p(x)}{q(x)}\right)\, dx }[/math]


where [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] denote the probability densities of [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math].

Kullback In applications, [math]\displaystyle{ P }[/math] typically represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution, while [math]\displaystyle{ Q }[/math] typically represents a theory, model, description, or approximation of [math]\displaystyle{ P }[/math]. In order to find a distribution [math]\displaystyle{ Q }[/math] that is closest to [math]\displaystyle{ P }[/math], we can minimize KL divergence and compute an information projection.

在应用程序中,p 通常代表数据、观测值的“真实”分布,或者精确计算的理论分布,而 q 通常代表理论、模型、描述或者近似值。为了找到一个最接近于 < math > p </math > 的分布,我们可以最小化 KL 散度并计算一个信息投影。


More generally, if [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] are probability measures over a set [math]\displaystyle{ {\mathcal{X}} }[/math], and [math]\displaystyle{ P }[/math] is absolutely continuous with respect to [math]\displaystyle{ Q }[/math], then the Kullback–Leibler divergence from [math]\displaystyle{ Q }[/math] to [math]\displaystyle{ P }[/math] is defined as

The Kullback–Leibler divergence is a special case of a broader class of statistical divergences called f-divergences as well as the class of Bregman divergences. It is the only such divergence over probabilities that is a member of both classes. Although it is often intuited as a way of measuring the distance between probability distributions, the Kullback–Leibler divergence is not a true metric. It does not obey the Triangle Inequality, and in general [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math] does not equal [math]\displaystyle{ D_\text{KL}(Q \parallel P) }[/math]. However, its infinitesimal form, specifically its Hessian, gives a metric tensor known as the Fisher information metric.

Kullback-Leibler 散度是一类更广泛的统计散度的特例,称为 f- 散度和 Bregman 散度。这是唯一的这样的分歧概率,是一个成员的两个类。虽然人们通常凭直觉来判断概率分布之间的距离,但 Kullback 散度并不是一个真正的度量标准。它不服从三角不等式,一般情况下不等于。然而,它的无穷小的形式,特别是它的 Hessian,给出了一个度量张量,称为费雪资讯度量。


[math]\displaystyle{ D_\text{KL}(P \parallel Q) = \int_{\mathcal{X}} \log\left(\frac{dP}{dQ}\right)\, dP, }[/math]

Arthur Hobson proved that the Kullback–Leibler divergence is the only measure of difference between probability distributions that satisfies some desired properties, which are the canonical extension to those appearing in a commonly used characterization of entropy. Consequently, mutual information is the only measure of mutual dependence that obeys certain related conditions, since it can be defined in terms of Kullback–Leibler divergence.

证明了 Kullback-Leibler 散度是满足一些期望性质的概率分布之间的唯一差异度量,这些期望性质是出现在通常使用的熵角色塑造中的概率分布的典范扩展。因此,互信息是唯一符合某些相关条件的相互依赖度量,因为它可以用 Kullback-Leibler 发散度来定义。


where [math]\displaystyle{ \frac{dP}{dQ} }[/math] is the Radon–Nikodym derivative of [math]\displaystyle{ P }[/math] with respect to [math]\displaystyle{ Q }[/math], and provided the expression on the right-hand side exists. Equivalently (by the chain rule), this can be written as


[math]\displaystyle{ D_\text{KL}(P \parallel Q) = \int_{\mathcal{X}} \log\left(\frac{dP}{dQ}\right) \frac{dP}{dQ}\, dQ, }[/math]

Illustration of the Kullback–Leibler (KL) divergence for two normal distributions. The typical asymmetry for the Kullback–Leibler divergence is clearly visible.

关于两个[正态分布]的 Kullback-Leibler (KL)散度的图解。Kullback-Leibler 分歧的典型不对称性是显而易见的。]


which is the entropy of [math]\displaystyle{ Q }[/math] relative to [math]\displaystyle{ P }[/math]. Continuing in this case, if [math]\displaystyle{ \mu }[/math] is any measure on [math]\displaystyle{ \mathcal{X} }[/math] for which [math]\displaystyle{ p = \frac{dP}{d\mu} }[/math] and [math]\displaystyle{ q = \frac{dQ}{d\mu} }[/math] exist (meaning that [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] are absolutely continuous with respect to [math]\displaystyle{ \mu }[/math]), then the Kullback–Leibler divergence from [math]\displaystyle{ Q }[/math] to [math]\displaystyle{ P }[/math] is given as

In information theory, the Kraft–McMillan theorem establishes that any directly decodable coding scheme for coding a message to identify one value [math]\displaystyle{ x_i }[/math] out of a set of possibilities [math]\displaystyle{ X }[/math] can be seen as representing an implicit probability distribution [math]\displaystyle{ q(x_i)=2^{-\ell_i} }[/math] over [math]\displaystyle{ X }[/math], where [math]\displaystyle{ \ell_i }[/math] is the length of the code for [math]\displaystyle{ x_i }[/math] in bits. Therefore, the Kullback–Leibler divergence can be interpreted as the expected extra message-length per datum that must be communicated if a code that is optimal for a given (wrong) distribution [math]\displaystyle{ Q }[/math] is used, compared to using a code based on the true distribution [math]\displaystyle{ P }[/math].

在信息理论中,Kraft-McMillan 定理确立了任何直接可解码的编码方案,用于从一组可能性中识别一个值。这个编码方案可以被看作是隐式的概率分布。因此,与使用基于真实分布的代码相比,Kullback-Leibler 分歧可以解释为每个数据预期的额外消息长度,如果使用对于给定(错误的)分布 < math > q </math > 最优的代码,则必须通信这个消息长度。


[math]\displaystyle{ D_\text{KL}(P \parallel Q) = \int_{\mathcal{X}} p \log\left(\frac{p}{q}\right)\, d\mu. }[/math]

[math]\displaystyle{ \begin{align} 1.1.1.2.2.2.2.2.2.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3 D_\text{KL}(P\parallel Q) &= -\sum_{x\in\mathcal{X}} p(x) \log q(x) + \sum_{x\in\mathcal{X}} p(x) \log p(x) \\ D _ text { KL }(p 并行 q) & =-sum _ { x in mathcal { x } p (x) log q (x) + sum _ { x in mathcal { x } p (x) log p (x)) The logarithms in these formulae are taken to [[Base (exponentiation)|base]] 2 if information is measured in units of [[bit]]s, or to base \lt math\gt e }[/math] if information is measured in nats. Most formulas involving the Kullback–Leibler divergence hold regardless of the base of the logarithm.

&= \Eta(P, Q) - \Eta(P)

& = Eta (p,q)-Eta (p)


\end{align}</math>

结束{ align } </math >

Various conventions exist for referring to [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math] in words. Often it is referred to as the divergence between [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math], but this fails to convey the fundamental asymmetry in the relation. Sometimes, as in this article, it may be described as the divergence of [math]\displaystyle{ P }[/math] from [math]\displaystyle{ Q }[/math] or as the divergence from [math]\displaystyle{ Q }[/math] to [math]\displaystyle{ P }[/math]. This reflects the asymmetry in Bayesian inference, which starts from a prior [math]\displaystyle{ Q }[/math] and updates to the posterior [math]\displaystyle{ P }[/math]. Another common way to refer to [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math] is as the relative entropy of [math]\displaystyle{ P }[/math] with respect to [math]\displaystyle{ Q }[/math].


where [math]\displaystyle{ \Eta(P,Q) }[/math] is the cross entropy of [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math], and [math]\displaystyle{ \Eta(P) }[/math] is the entropy of [math]\displaystyle{ P }[/math] (which is the same as the cross-entropy of P with itself).

其中,Eta (p,q) </math > 是 < math > p </math > 和 < math > q </math > 的交叉熵,而 < math > Eta (p) < math > 是 < math > p </math > 的交叉熵。

Basic example

The KL divergence [math]\displaystyle{ KL(P \parallel Q) }[/math] can be thought of as something like a measurement of how far the distribution Q is from the distribution P. The cross-entropy [math]\displaystyle{ H(P,Q) }[/math] is itself such a measurement, but it has the defect that [math]\displaystyle{ H(P,P)=:H(P) }[/math] isn't zero, so we subtract [math]\displaystyle{ H(P) }[/math] to make [math]\displaystyle{ KL(P \parallel Q) }[/math] agree more closely with our notion of distance. (Unfortunately it still isn't symmetric.)

KL 散度 < math > KL (p 平行 q) </math > 可以被认为是类似于测量分布 q 离分布 p 有多远。交叉熵 h (p,q) </math > 本身就是这样一种度量,但它有一个缺陷,即 h (p,p) = : h (p) </math > 不是零,所以我们用减法 h (p) </math > 使 KL (p 平行 q) </math > 更接近我们的距离概念。(不幸的是,它仍然不对称。)

Kullback[2] gives the following example (Table 2.1, Example 2.1). Let [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] be the distributions shown in the table and figure. [math]\displaystyle{ P }[/math] is the distribution on the left side of the figure, a binomial distribution with [math]\displaystyle{ N = 2 }[/math] and [math]\displaystyle{ p = 0.4 }[/math]. [math]\displaystyle{ Q }[/math] is the distribution on the right side of the figure, a discrete uniform distribution with the three possible outcomes [math]\displaystyle{ x = 0 }[/math], [math]\displaystyle{ 1 }[/math], or [math]\displaystyle{ 2 }[/math] (i.e. [math]\displaystyle{ \mathcal{X}=\{0,1,2\} }[/math]), each with probability [math]\displaystyle{ p = 1/3 }[/math].

There is a relation between the Kullback–Leibler divergence and the "rate function" in the theory of large deviations.

在大偏差理论中,Kullback-Leibler 散度与“速率函数”之间有一定的关系。


Two distributions to illustrate Kullback–Leibler divergence

[math]\displaystyle{ D_\text{KL}(P\parallel Q) \geq 0, }[/math] (p 并行 q) geq 0,</math > [math]\displaystyle{ \begin{align} 1.1.1.2.2.2.2.2.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3 | Distribution ''Q''(x) || \lt math\gt 1/3 }[/math] || [math]\displaystyle{ 1/3 }[/math] || [math]\displaystyle{ 1/3 }[/math] D_\text{KL}(P \parallel Q) D _ text { KL }(p 并行 q)
x 0 1 2
a result known as Gibbs' inequality, with [math]\displaystyle{ D_\text{KL}(P\parallel Q) }[/math] zero if and only if [math]\displaystyle{ P = Q }[/math] almost everywhere. The entropy [math]\displaystyle{ \Eta(P) }[/math] thus sets a minimum value for the cross-entropy [math]\displaystyle{ \Eta(P, Q) }[/math], the expected number of bits required when using a code based on [math]\displaystyle{ Q }[/math] rather than [math]\displaystyle{ P }[/math]; and the Kullback–Leibler divergence therefore represents the expected number of extra bits that must be transmitted to identify a value [math]\displaystyle{ x }[/math] drawn from [math]\displaystyle{ X }[/math], if a code is used corresponding to the probability distribution [math]\displaystyle{ Q }[/math], rather than the "true" distribution [math]\displaystyle{ P }[/math].

这个结果被称为 Gibbs 不等式,当且仅当 < math > p = q </math > 几乎无处不在。因此,熵设置了交叉熵的最小值,即使用基于 < math > q </math > 而不是 < math > 的代码时所需要的预期位数; 而 Kullback-Leibler 的差异则代表了为了确定一个值而必须传输的额外位数,这个值来自 < math > x </math > ,如果一个代码使用的是对应于概率分布数学 < q </math > 而不是“真实”分布数学 < p </math > 。

Distribution P(x) [math]\displaystyle{ 9/25 }[/math] [math]\displaystyle{ 12/25 }[/math] [math]\displaystyle{ 4/25 }[/math]
&= \int_{x_a}^{x_b} P(x)\log\left(\frac{P(x)}{Q(x)}\right)\, dx \\[6pt]

& = int _ { x _ a } ^ { x _ b } p (x) log left (frac { p (x)}{ q (x)} right) ,dx [6 pt ]


&= \int_{y_a}^{y_b} P(y)\log\left(\frac{P(y)\, \frac{dy}{dx}}{Q(y)\, \frac{dy}{dx}}\right)\, dy

& = int _ { y _ a } ^ { y _ b } p (y) log left (frac { p (y) ,frac { dy }{ dx }{ q (y) ,frac { dy }{ dx }右) ,dy

The KL divergences [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math] and [math]\displaystyle{ D_\text{KL}(Q \parallel P) }[/math] are calculated as follows. This example uses the natural log with base e, designated [math]\displaystyle{ \operatorname{ln} }[/math] to get results in nats (see units of information).

= \int_{y_a}^{y_b} P(y)\log\left(\frac{P(y)}{Q(y)}\right)\, dy

= int _ { y _ a } ^ { y _ b } p (y) log left (frac { p (y)}{ q (y)} right) ,dy


\end{align}</math>

结束{ align } </math >

[math]\displaystyle{ \begin{align} where \lt math\gt y_a = y(x_a) }[/math] and [math]\displaystyle{ y_b = y(x_b) }[/math]. Although it was assumed that the transformation was continuous, this need not be the case. This also shows that the Kullback–Leibler divergence produces a dimensionally consistent quantity, since if [math]\displaystyle{ x }[/math] is a dimensioned variable, [math]\displaystyle{ P(x) }[/math] and [math]\displaystyle{ Q(x) }[/math] are also dimensioned, since e.g. [math]\displaystyle{ P(x) dx }[/math] is dimensionless. The argument of the logarithmic term is and remains dimensionless, as it must. It can therefore be seen as in some ways a more fundamental quantity than some other properties in information theory (such as self-information or Shannon entropy), which can become undefined or negative for non-discrete probabilities.

其中 y _ a = y (x _ a) </math > 和 y _ b = y (x _ b) </math > 。虽然假定转换是连续的,但是不需要这样。这也表明,Kullback 散度产生了一个维度一致的量,因为如果 < math > x </math > 是一个维数变量,< math > p (x) </math > 和 < math > q (x) </math > 也是维数,因为。P (x) dx </math > 是无量纲的。对数项的论点是无量纲的,而且仍然是无量纲的。因此,在某些方面,它可以被看作是一个比信息论中其他一些性质(如自信息或香农熵)更基本的量,这些性质对于非离散概率来说可能变得不确定或者是负的。

D_\text{KL}(P \parallel Q) &= \sum_{x\in\mathcal{X}} P(x) \ln\left(\frac{P(x)}{Q(x)}\right) \\

&= \frac{9}{25} \ln\left(\frac{9/25}{1/3}\right)

+ \frac{12}{25} \ln\left(\frac{12/25}{1/3}\right)

[math]\displaystyle{ D_\text{KL}(P \parallel Q) = D_\text{KL}(P_1 \parallel Q_1) + D_\text{KL}(P_2 \parallel Q_2). }[/math]

(p 并行 q) = d _ text { KL }(p _ 1并行 q _ 1) + d _ text { KL }(p _ 2并行 q _ 2) . </math >

+ \frac{4}{25} \ln\left(\frac{4/25}{1/3}\right) \\

& = \frac{1}{25} \left(32 \ln(2) + 55 \ln(3) - 50 \ln(5) \right) \approx 0.0852996

\end{align}</math>


\lambda D_\text{KL}(p_1 \parallel q_1) + (1 - \lambda)D_\text{KL}(p_2 \parallel q_2) \text{ for } 0 \le \lambda \le 1.</math>

文本{ KL }(p _ 1并行 q _ 1) + (1-lambda) d _ text { KL }(p _ 2并行 q _ 2) text { for }0 le lambda le 1. </math >

[math]\displaystyle{ \begin{align} D_\text{KL}(Q \parallel P) &= \sum_{x\in\mathcal{X}} Q(x) \ln\left(\frac{Q(x)}{P(x)}\right) \\ &= \frac{1}{3} \ln\left(\frac{1/3}{9/25}\right) + \frac{1}{3} \ln\left(\frac{1/3}{12/25}\right) + \frac{1}{3} \ln\left(\frac{1/3}{4/25}\right) \\ Suppose that we have two multivariate normal distributions, with means \lt math\gt \mu_0, \mu_1 }[/math] and with (non-singular) covariance matrices [math]\displaystyle{ \Sigma_0, \Sigma_1. }[/math] If the two distributions have the same dimension, [math]\displaystyle{ k }[/math], then the Kullback–Leibler divergence between the distributions is as follows:

假设我们有两个多元正态分布,均值 < math > mu _ 0,mu _ 1 </math > ,协方差矩阵 < math > Sigma _ 0,Sigma _ 1。如果两个分布具有相同的维度,那么分布之间的 Kullback-Leibler 差异如下:

&= \frac{1}{3} \left(-4 \ln(2) - 6 \ln(3) + 6 \ln(5) \right) \approx 0.097455

\end{align}</math>

[math]\displaystyle{ 《数学》 D_\text{KL}\left(\mathcal{N}_0 \parallel \mathcal{N}_1\right) = 左(数学{ n } _ 0并行数学{ n } _ 1右) = ==Interpretations== \frac{1}{2}\left( 1}{2} left ( \operatorname{tr}\left(\Sigma_1^{-1}\Sigma_0\right) + 操作符名称{ tr }左(Sigma _ 1 ^ {-1} Sigma _ 0右) + The Kullback–Leibler divergence from \lt math\gt Q }[/math] to [math]\displaystyle{ P }[/math] is often denoted [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math].

   \left(\mu_1 - \mu_0\right)^\mathsf{T} \Sigma_1^{-1}\left(\mu_1 - \mu_0\right) - k +

左(mu _ 1-mu _ 0右) ^ mathsf { t } Sigma _ 1 ^ {-1}左(mu _ 1-mu _ 0右)-k +


   \ln\left(\frac{\det\Sigma_1}{\det\Sigma_0}\right)

左(frac { det Sigma _ 1}{ det Sigma _ 0} right)

In the context of machine learning, [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math] is often called the information gain achieved if [math]\displaystyle{ Q }[/math] is used instead of [math]\displaystyle{ P }[/math]. By analogy with information theory, it is also called the relative entropy of [math]\displaystyle{ P }[/math] with respect to [math]\displaystyle{ Q }[/math]. In the context of coding theory, [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math] can be constructed by measuring the expected number of extra bits required to code samples from [math]\displaystyle{ P }[/math] using a code optimized for [math]\displaystyle{ Q }[/math] rather than the code optimized for [math]\displaystyle{ P }[/math].

 \right).

右)。


</math>

数学

Expressed in the language of Bayesian inference, [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math] is a measure of the information gained by revising one's beliefs from the prior probability distribution [math]\displaystyle{ Q }[/math] to the posterior probability distribution [math]\displaystyle{ P }[/math]. In other words, it is the amount of information lost when [math]\displaystyle{ Q }[/math] is used to approximate [math]\displaystyle{ P }[/math].[6] In applications, [math]\displaystyle{ P }[/math] typically represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution, while [math]\displaystyle{ Q }[/math] typically represents a theory, model, description, or approximation of [math]\displaystyle{ P }[/math]. In order to find a distribution [math]\displaystyle{ Q }[/math] that is closest to [math]\displaystyle{ P }[/math], we can minimize KL divergence and compute an information projection.


The logarithm in the last term must be taken to base e since all terms apart from the last are base-e logarithms of expressions that are either factors of the density function or otherwise arise naturally. The equation therefore gives a result measured in nats. Dividing the entire expression above by [math]\displaystyle{ \ln(2) }[/math] yields the divergence in bits.

最后一项中的对数必须以 e 为底,因为除最后一项以外的所有项都是表达式的 e 底对数,这些表达式要么是密度函数的因子,要么是自然产生的。因此,这个方程给出了一个用 nats 测量的结果。将上面的整个表达式除以 < math > ln (2) </math > 会产生以位为单位的分歧。

The Kullback–Leibler divergence is a special case of a broader class of statistical divergences called f-divergences as well as the class of Bregman divergences. It is the only such divergence over probabilities that is a member of both classes. Although it is often intuited as a way of measuring the distance between probability distributions, the Kullback–Leibler divergence is not a true metric. It does not obey the Triangle Inequality, and in general [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math] does not equal [math]\displaystyle{ D_\text{KL}(Q \parallel P) }[/math]. However, its infinitesimal form, specifically its Hessian, gives a metric tensor known as the Fisher information metric.


A special case, and a common quantity in variational inference, is the KL-divergence between a diagonal multivariate normal, and a standard normal distribution (with zero mean and unit variance):

一个特殊的情形,也是变分推断中的一个公共量,是对角多元正态分布和标准正态分布(均值为零,方差为单位)之间的 kl- 散度:

Arthur Hobson proved that the Kullback–Leibler divergence is the only measure of difference between probability distributions that satisfies some desired properties, which are the canonical extension to those appearing in a commonly used characterization of entropy.[7] Consequently, mutual information is the only measure of mutual dependence that obeys certain related conditions, since it can be defined in terms of Kullback–Leibler divergence.


[math]\displaystyle{ 《数学》 ==Motivation== D_\text{KL}\left( 文本{ KL }左( \mathcal{N}\left(\left(\mu_1, \ldots, \mu_k\right)^\mathsf{T}, \operatorname{diag} \left(\sigma_1^2, \ldots, \sigma_k^2\right)\right) \parallel 数学{ n }左(左(mu _ 1,ldots,mu _ k 右) ^ sf { t } ,运算符名称{ diag }左(sigma _ 1 ^ 2,ldots,sigma _ k ^ 2右)平行 [[File:KL-Gauss-Example.png|thumb|right|320px|Illustration of the Kullback–Leibler (KL) divergence for two [[normal distribution]]s. The typical asymmetry for the Kullback–Leibler divergence is clearly visible.]] \mathcal{N}\left(\mathbf{0}, \mathbf{I}\right) 数学{ n }左(mathbf {0} ,mathbf { i }右) \right) = 右) = In information theory, the [[Kraft–McMillan inequality|Kraft–McMillan theorem]] establishes that any directly decodable coding scheme for coding a message to identify one value \lt math\gt x_i }[/math] out of a set of possibilities [math]\displaystyle{ X }[/math] can be seen as representing an implicit probability distribution [math]\displaystyle{ q(x_i)=2^{-\ell_i} }[/math] over [math]\displaystyle{ X }[/math], where [math]\displaystyle{ \ell_i }[/math] is the length of the code for [math]\displaystyle{ x_i }[/math] in bits. Therefore, the Kullback–Leibler divergence can be interpreted as the expected extra message-length per datum that must be communicated if a code that is optimal for a given (wrong) distribution [math]\displaystyle{ Q }[/math] is used, compared to using a code based on the true distribution [math]\displaystyle{ P }[/math].

 {1 \over 2} \sum_{i=1}^k \left(\sigma_i^2 + \mu_i^2 - 1 - \ln\left(\sigma_i^2\right)\right).

{1/2} sum _ { i = 1} ^ k left (sigma _ i ^ 2 + mu _ i ^ 2-1-ln left (sigma _ i ^ 2 right)).


</math>

数学

[math]\displaystyle{ \begin{align} D_\text{KL}(P\parallel Q) &= -\sum_{x\in\mathcal{X}} p(x) \log q(x) + \sum_{x\in\mathcal{X}} p(x) \log p(x) \\ &= \Eta(P, Q) - \Eta(P) \end{align} }[/math]

One might be tempted to call the Kullback–Leibler divergence a "distance metric" on the space of probability distributions, but this would not be correct as it is not symmetric – that is, [math]\displaystyle{ D_\text{KL}(P\parallel Q) \neq D_\text{KL}(Q\parallel P) }[/math] – nor does it satisfy the triangle inequality. Even so, being a premetric, it generates a topology on the space of probability distributions. More concretely, if [math]\displaystyle{ \{P_1,P_2,\ldots\} }[/math] is a sequence of distributions such that

人们可能会把 Kullback-Leibler 散度称为概率分布空间上的“距离度量” ,但这是不正确的,因为它不是对称的-也就是说,< math > d text { KL }(p 并行 q) neq d text { KL }(q 并行 p) </math >-也不满足三角不等式。即使如此,作为一个预测量,它在概率分布空间上生成一个拓扑。更具体地说,如果 < math > { p _ 1,p _ 2,ldots } </math > 是一个分布序列


where [math]\displaystyle{ \Eta(P,Q) }[/math] is the cross entropy of [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math], and [math]\displaystyle{ \Eta(P) }[/math] is the entropy of [math]\displaystyle{ P }[/math] (which is the same as the cross-entropy of P with itself).

[math]\displaystyle{ \lim_{n \to \infty} D_\text{KL}(P_n\parallel Q) = 0 }[/math]

< math > lim { n to infty } d text { KL }(p _ n parallel q) = 0 </math >


The KL divergence [math]\displaystyle{ KL(P \parallel Q) }[/math] can be thought of as something like a measurement of how far the distribution Q is from the distribution P. The cross-entropy [math]\displaystyle{ H(P,Q) }[/math] is itself such a measurement, but it has the defect that [math]\displaystyle{ H(P,P)=:H(P) }[/math] isn't zero, so we subtract [math]\displaystyle{ H(P) }[/math] to make [math]\displaystyle{ KL(P \parallel Q) }[/math] agree more closely with our notion of distance. (Unfortunately it still isn't symmetric.)

then it is said that

据说

There is a relation between the Kullback–Leibler divergence and the "rate function" in the theory of large deviations.[8][9]

Properties

[math]\displaystyle{ P_n \xrightarrow{D} Q . }[/math]

[数学][数学][数学]


Pinsker's inequality entails that

平斯克的不平等意味着这一点

[math]\displaystyle{ D_\text{KL}(P\parallel Q) \geq 0, }[/math]
a result known as Gibbs' inequality, with [math]\displaystyle{ D_\text{KL}(P\parallel Q) }[/math] zero if and only if [math]\displaystyle{ P = Q }[/math] almost everywhere. The entropy [math]\displaystyle{ \Eta(P) }[/math] thus sets a minimum value for the cross-entropy [math]\displaystyle{ \Eta(P, Q) }[/math], the expected number of bits required when using a code based on [math]\displaystyle{ Q }[/math] rather than [math]\displaystyle{ P }[/math]; and the Kullback–Leibler divergence therefore represents the expected number of extra bits that must be transmitted to identify a value [math]\displaystyle{ x }[/math] drawn from [math]\displaystyle{ X }[/math], if a code is used corresponding to the probability distribution [math]\displaystyle{ Q }[/math], rather than the "true" distribution [math]\displaystyle{ P }[/math].
[math]\displaystyle{ P_n \xrightarrow{D} P \Rightarrow P_n \xrightarrow{TV} P, }[/math]

[数学] ,[数学]


  • The Kullback–Leibler divergence remains well-defined for continuous distributions, and furthermore is invariant under parameter transformations. For example, if a transformation is made from variable [math]\displaystyle{ x }[/math] to variable [math]\displaystyle{ y(x) }[/math], then, since [math]\displaystyle{ P(x) dx = P(y) dy }[/math] and [math]\displaystyle{ Q(x) dx = Q(y) dy }[/math] the Kullback–Leibler divergence may be rewritten:

where the latter stands for the usual convergence in total variation.

其中后者代表通常的全变差收敛。

[math]\displaystyle{ \begin{align} D_\text{KL}(P \parallel Q) &= \int_{x_a}^{x_b} P(x)\log\left(\frac{P(x)}{Q(x)}\right)\, dx \\[6pt] The Kullback–Leibler divergence is directly related to the Fisher information metric. This can be made explicit as follows. Assume that the probability distributions \lt math\gt P }[/math] and [math]\displaystyle{ Q }[/math] are both parameterized by some (possibly multi-dimensional) parameter [math]\displaystyle{ \theta }[/math]. Consider then two close-by values of [math]\displaystyle{ P = P(\theta) }[/math] and [math]\displaystyle{ Q = P(\theta_0) }[/math] so that the parameter [math]\displaystyle{ \theta }[/math] differs by only a small amount from the parameter value [math]\displaystyle{ \theta_0 }[/math]. Specifically, up to first order one has (using the Einstein summation convention)

Kullback-Leibler 散度与费雪资讯度量直接相关。这一点可以明确如下。假设概率分布 < math > p </math > 和 < math > q </math > 都由一些(可能是多维的)参数参数 < math > theta </math > 参数化。然后考虑两个相近的值: < math > p = p (theta) </math > 和 < math > q = p (theta _ 0) </math > ,因此参数 < math > theta </math > 与参数 < math > theta _ 0 </math > 相差很小。具体地说,最多有一阶(使用爱因斯坦求和约定)

&= \int_{y_a}^{y_b} P(y)\log\left(\frac{P(y)\, \frac{dy}{dx}}{Q(y)\, \frac{dy}{dx}}\right)\, dy

[math]\displaystyle{ P(\theta) = P(\theta_0) + \Delta\theta_j P_j(\theta_0) + \cdots }[/math]

< math > p (theta) = p (theta _ 0) + Delta theta _ j p _ j (theta _ 0) + cdots </math >

= \int_{y_a}^{y_b} P(y)\log\left(\frac{P(y)}{Q(y)}\right)\, dy

\end{align}</math>

with [math]\displaystyle{ \Delta\theta_j = (\theta - \theta_0)_j }[/math] a small change of [math]\displaystyle{ \theta }[/math] in the [math]\displaystyle{ j }[/math] direction, and [math]\displaystyle{ P_j\left(\theta_0\right) = \frac{\partial P}{\partial \theta_j}(\theta_0) }[/math] the corresponding rate of change in the probability distribution. Since the Kullback–Leibler divergence has an absolute minimum 0 for [math]\displaystyle{ P = Q }[/math], i.e. [math]\displaystyle{ \theta = \theta_0 }[/math], it changes only to second order in the small parameters [math]\displaystyle{ \Delta\theta_j }[/math]. More formally, as for any minimum, the first derivatives of the divergence vanish

在数学方向上,δ θ θ j = (theta-theta _ 0) θ 在数学方向上有一个小小的变化,p j 左(theta _ 0右) = frac { partial p } theta _ j }(theta _ 0) </math > 相应的概率分布变化率。因为 Kullback-Leibler 分歧对于 < math > p = q </math > 有一个绝对最小的0,即。在小参数中,θ = theta 0 </math > ,它只变成二阶。更正式地说,对于任何最小值,散度的一阶导数消失了

where [math]\displaystyle{ y_a = y(x_a) }[/math] and [math]\displaystyle{ y_b = y(x_b) }[/math]. Although it was assumed that the transformation was continuous, this need not be the case. This also shows that the Kullback–Leibler divergence produces a dimensionally consistent quantity, since if [math]\displaystyle{ x }[/math] is a dimensioned variable, [math]\displaystyle{ P(x) }[/math] and [math]\displaystyle{ Q(x) }[/math] are also dimensioned, since e.g. [math]\displaystyle{ P(x) dx }[/math] is dimensionless. The argument of the logarithmic term is and remains dimensionless, as it must. It can therefore be seen as in some ways a more fundamental quantity than some other properties in information theory[10] (such as self-information or Shannon entropy), which can become undefined or negative for non-discrete probabilities.


[math]\displaystyle{ \left.\frac{\partial}{\partial\theta_j}\right|_{\theta = \theta_0} D_\text{KL}(P(\theta) \parallel P(\theta_0)) = 0, }[/math]

(p (theta) parallel p (theta _ 0)) = 0,</math >

  • The Kullback–Leibler divergence is additive for independent distributions in much the same way as Shannon entropy. If [math]\displaystyle{ P_1, P_2 }[/math] are independent distributions, with the joint distribution [math]\displaystyle{ P(x, y) = P_1(x)P_2(y) }[/math], and [math]\displaystyle{ Q, Q_1, Q_2 }[/math] likewise, then
[math]\displaystyle{ D_\text{KL}(P \parallel Q) = D_\text{KL}(P_1 \parallel Q_1) + D_\text{KL}(P_2 \parallel Q_2). }[/math]

and by the Taylor expansion one has up to second order

根据泰勒展开式,可以达到二阶


  • The Kullback–Leibler divergence [math]\displaystyle{ D_\text{KL}(P \parallel Q) }[/math] is convex in the pair of probability mass functions [math]\displaystyle{ (p,q) }[/math], i.e. if [math]\displaystyle{ (p_1,q_1) }[/math] and [math]\displaystyle{ (p_2,q_2) }[/math] are two pairs of probability mass functions, then

[math]\displaystyle{ D_\text{KL}(P(\theta) \parallel P(\theta_0)) = \frac{1}{2} \Delta\theta_j\Delta\theta_k g_{jk}(\theta_0) + \cdots }[/math]

(p (theta) parallel p (theta _ 0)) = frac {1}{2} Delta theta _ j Delta theta _ g _ (jk _ 0) + cdots </math >

  • [math]\displaystyle{ D_\text{KL}(\lambda p_1 + (1 - \lambda) p_2 \parallel \lambda q_1 + (1 - \lambda) q_2) \le \lambda D_\text{KL}(p_1 \parallel q_1) + (1 - \lambda)D_\text{KL}(p_2 \parallel q_2) \text{ for } 0 \le \lambda \le 1. }[/math]

where the Hessian matrix of the divergence

散度的黑森矩阵在哪里


Examples

[math]\displaystyle{ g_{jk}(\theta_0) = \left.\frac{\partial^2}{\partial\theta_j\, \partial\theta_k} \right|_{\theta = \theta_0} D_\text{KL}(P(\theta) \parallel P(\theta_0)) }[/math]

(theta _ 0) = left. frac { partial ^ 2}{ partial theta _ j,partial theta _ k } right | { theta = theta _ 0} d text { KL }(p (theta) parallel p (theta _ 0)) </math >


Multivariate normal distributions

must be positive semidefinite. Letting [math]\displaystyle{ \theta_0 }[/math] vary (and dropping the subindex 0) the Hessian [math]\displaystyle{ g_{jk}(\theta) }[/math] defines a (possibly degenerate) Riemannian metric on the parameter space, called the Fisher information metric.

一定是半正定的。在参数空间上定义了一个(可能是简并的)黎曼度量,称为费雪资讯度量。

Suppose that we have two multivariate normal distributions, with means [math]\displaystyle{ \mu_0, \mu_1 }[/math] and with (non-singular) covariance matrices [math]\displaystyle{ \Sigma_0, \Sigma_1. }[/math] If the two distributions have the same dimension, [math]\displaystyle{ k }[/math], then the Kullback–Leibler divergence between the distributions is as follows:[11]:p. 13


[math]\displaystyle{ When \lt math\gt p_{(x, \rho)} }[/math] satisfies the following regularity conditions:

当 < math > p _ {(x,rho)} </math > 满足以下正则性条件时:

 D_\text{KL}\left(\mathcal{N}_0 \parallel \mathcal{N}_1\right) =
 \frac{1}{2}\left(

[math]\displaystyle{ \tfrac{\partial \log(p)}{\partial \rho}, \tfrac{\partial^2 \log(p)}{\partial \rho^2}, \tfrac{\partial^3 \log(p)}{\partial \rho^3} }[/math] exist,

{ partial log (p)} ,tfrac { partial ^ 2 log (p)}{ partial rho ^ 3 log (p)}{ partial rho ^ 3} </math > exist,

   \operatorname{tr}\left(\Sigma_1^{-1}\Sigma_0\right) +

[math]\displaystyle{ \begin{align} 1.1.1.2.2.2.2.2.2.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3 \left(\mu_1 - \mu_0\right)^\mathsf{T} \Sigma_1^{-1}\left(\mu_1 - \mu_0\right) - k + \left|\frac{\partial p}{\partial \rho}\right| &\lt F(x): \int_{x=0}^\infty F(x)\,dx \lt \infty, \\ 左 | frac { partial p }{ partial rho } right | & \lt f (x) : int _ { x = 0} ^ infty f (x) ,dx \lt infty, \ln\left(\frac{\det\Sigma_1}{\det\Sigma_0}\right) \left|\frac{\partial^2 p}{\partial \rho^2}\right| &\lt G(x): \int_{x=0}^\infty G(x)\,dx \lt \infty \\ 左 | frac { partial ^ 2 p }{ partial rho ^ 2} right | & \lt g (x) : int _ { x = 0} ^ infty g (x) ,dx \lt infty \right). \left|\frac{\partial^3 \log(p)}{\partial \rho^3}\right| &\lt H(x): \int_{x=0}^\infty p(x, 0)H(x)\,dx \lt \xi \lt \infty 左 | frac { partial ^ 3 log (p)}{ partial rho ^ 3} right | & \lt h (x) : int _ { x = 0} ^ infty p (x,0) h (x) ,dx \lt xi \lt infty }[/math]

\end{align}</math>

结束{ align } </math >


The logarithm in the last term must be taken to base e since all terms apart from the last are base-e logarithms of expressions that are either factors of the density function or otherwise arise naturally. The equation therefore gives a result measured in nats. Dividing the entire expression above by [math]\displaystyle{ \ln(2) }[/math] yields the divergence in bits.

where is independent of

什么是独立的


[math]\displaystyle{ 

《数学》

A special case, and a common quantity in [[variational inference]], is the KL-divergence between a diagonal multivariate normal, and a standard normal distribution (with zero mean and unit variance):

 \left.\int_{x=0}^\infty \frac{\partial p(x, \rho)}{\partial \rho}\right|_{\rho=0}\, dx =

左边. int { x = 0} ^ infty frac { partial p (x,rho)}{ partial rho } right | { rho = 0} ,dx = 



 \left.\int_{x=0}^\infty \frac{\partial^2 p(x, \rho)}{\partial \rho^2}\right|_{\rho=0}\, dx = 0

2} right | { rho = 0} ,dx = 0

:\lt math\gt 

 }[/math]

数学

 D_\text{KL}\left(
   \mathcal{N}\left(\left(\mu_1, \ldots, \mu_k\right)^\mathsf{T}, \operatorname{diag} \left(\sigma_1^2, \ldots, \sigma_k^2\right)\right) \parallel

then:

然后:

   \mathcal{N}\left(\mathbf{0}, \mathbf{I}\right)
[math]\displaystyle{ \mathcal{D}(p(x, 0) \parallel p(x, \rho)) = \frac{c\rho^2}{2} + \mathcal{O}\left(\rho^3\right) \text{ as } \rho \to 0. }[/math]

(p (x,0) parallel p (x,rho)) = frac { c rho ^ 2}{2} + mathcal { o } left (rho ^ 3 right) text { as } rho to 0. </math >

 \right) =
 {1 \over 2} \sum_{i=1}^k \left(\sigma_i^2 + \mu_i^2 - 1 - \ln\left(\sigma_i^2\right)\right).

</math>

Another information-theoretic metric is Variation of information, which is roughly a symmetrization of conditional entropy. It is a metric on the set of partitions of a discrete probability space.

另一个信息论度量是信息的变化,它大致是条件熵的对称化。它是离散概率空间的分区集合上的一个度量。


Relation to metrics

Many of the other quantities of information theory can be interpreted as applications of the Kullback–Leibler divergence to specific cases.

信息论的许多其他量可以解释为 Kullback-Leibler 分歧在特定情况下的应用。

One might be tempted to call the Kullback–Leibler divergence a "distance metric" on the space of probability distributions, but this would not be correct as it is not symmetric – that is, [math]\displaystyle{ D_\text{KL}(P\parallel Q) \neq D_\text{KL}(Q\parallel P) }[/math] – nor does it satisfy the triangle inequality. Even so, being a premetric, it generates a topology on the space of probability distributions. More concretely, if [math]\displaystyle{ \{P_1,P_2,\ldots\} }[/math] is a sequence of distributions such that


[math]\displaystyle{ \lim_{n \to \infty} D_\text{KL}(P_n\parallel Q) = 0 }[/math]


The self-information, also known as the information content of a signal, random variable, or event is defined as the negative logarithm of the probability of the given outcome occurring.

自信息,也称为信号、随机变量或事件的信息含量,被定义为给定结果发生概率的负对数。

then it is said that


When applied to a discrete random variable, the self-information can be represented as

当应用于离散随机变量时,自信息可以表示为

[math]\displaystyle{ P_n \xrightarrow{D} Q . }[/math]


[math]\displaystyle{ \operatorname \operatorname{I}(m) = D_\text{KL}\left(\delta_\text{im} \parallel \{p_i\}\right), }[/math]

(m) = d _ text { KL } left (delta _ text { im } parallel { p _ i } right) ,</math >

Pinsker's inequality entails that


is the Kullback–Leibler divergence of the probability distribution [math]\displaystyle{ P(i) }[/math] from a Kronecker delta representing certainty that [math]\displaystyle{ i = m }[/math] — i.e. the number of extra bits that must be transmitted to identify [math]\displaystyle{ i }[/math] if only the probability distribution [math]\displaystyle{ P(i) }[/math] is available to the receiver, not the fact that [math]\displaystyle{ i = m }[/math].

是概率分布的 Kullback-Leibler 分歧 p (i) </math > p (i) </math > 来自 Kronecker delta,代表确定性 < math > i = m </math > 。如果接收者只能使用概率分布的话,那么为了识别 < math > 而必须传输的额外位数,而不是 < math > i = m </math > 这个事实。

[math]\displaystyle{ P_n \xrightarrow{D} P \Rightarrow P_n \xrightarrow{TV} P, }[/math]


where the latter stands for the usual convergence in total variation.

The mutual information,

相互信息,


Fisher information metric

[math]\displaystyle{ \begin{align} 1.1.1.2.2.2.2.2.2.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3 The Kullback–Leibler divergence is directly related to the [[Fisher information metric]]. This can be made explicit as follows. Assume that the probability distributions \lt math\gt P }[/math] and [math]\displaystyle{ Q }[/math] are both parameterized by some (possibly multi-dimensional) parameter [math]\displaystyle{ \theta }[/math]. Consider then two close-by values of [math]\displaystyle{ P = P(\theta) }[/math] and [math]\displaystyle{ Q = P(\theta_0) }[/math] so that the parameter [math]\displaystyle{ \theta }[/math] differs by only a small amount from the parameter value [math]\displaystyle{ \theta_0 }[/math]. Specifically, up to first order one has (using the Einstein summation convention)

\operatorname{I}(X; Y)

操作员名称{ i }(x; y)

[math]\displaystyle{ P(\theta) = P(\theta_0) + \Delta\theta_j P_j(\theta_0) + \cdots }[/math]
  &= D_\text{KL}(P(X, Y) \parallel P(X)P(Y)) \\

& = d _ text { KL }(p (x,y)并行 p (x) p (y))


  &= \operatorname{E}_X \{D_\text{KL}(P(Y \mid X) \parallel P(Y))\} \\

& = 操作者名{ e } _ x { d _ text { KL }(p (y mid x) parallel p (y))}

with [math]\displaystyle{ \Delta\theta_j = (\theta - \theta_0)_j }[/math] a small change of [math]\displaystyle{ \theta }[/math] in the [math]\displaystyle{ j }[/math] direction, and [math]\displaystyle{ P_j\left(\theta_0\right) = \frac{\partial P}{\partial \theta_j}(\theta_0) }[/math] the corresponding rate of change in the probability distribution. Since the Kullback–Leibler divergence has an absolute minimum 0 for [math]\displaystyle{ P = Q }[/math], i.e. [math]\displaystyle{ \theta = \theta_0 }[/math], it changes only to second order in the small parameters [math]\displaystyle{ \Delta\theta_j }[/math]. More formally, as for any minimum, the first derivatives of the divergence vanish

  &= \operatorname{E}_Y \{D_\text{KL}(P(X \mid Y) \parallel P(X))\}

& = 操作数名{ e } _ y { d _ text { KL }(p (x mid y) parallel p (x))}


\end{align}</math>

结束{ align } </math >

[math]\displaystyle{ \left.\frac{\partial}{\partial\theta_j}\right|_{\theta = \theta_0} D_\text{KL}(P(\theta) \parallel P(\theta_0)) = 0, }[/math]


is the Kullback–Leibler divergence of the product [math]\displaystyle{ P(X)P(Y) }[/math] of the two marginal probability distributions from the joint probability distribution [math]\displaystyle{ P(X,Y) }[/math] — i.e. the expected number of extra bits that must be transmitted to identify [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] if they are coded using only their marginal distributions instead of the joint distribution. Equivalently, if the joint probability [math]\displaystyle{ P(X,Y) }[/math] is known, it is the expected number of extra bits that must on average be sent to identify [math]\displaystyle{ Y }[/math] if the value of [math]\displaystyle{ X }[/math] is not already known to the receiver.

P (x) p (y) </math > p (x) p (y) </math > p (y) </math > p (x,y) </math > p (x,y) </math > .如果只使用它们的边际分布而不是联合分布来编码,那么为了识别 < math > x </math > 和 < math > y </math > 而必须传输的预期额外位数。等价地,如果联合概率 < math > p (x,y) </math > 是已知的,那么如果 < math > x </math > 的值对于接收者来说还不知道,那么它就是平均来说为了识别 < math > y </math > 而必须发送的预期额外位数。

and by the Taylor expansion one has up to second order


[math]\displaystyle{ D_\text{KL}(P(\theta) \parallel P(\theta_0)) = \frac{1}{2} \Delta\theta_j\Delta\theta_k g_{jk}(\theta_0) + \cdots }[/math]

The Shannon entropy,

香农熵,


where the Hessian matrix of the divergence

[math]\displaystyle{ \begin{align} 1.1.1.2.2.2.2.2.2.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3 \Eta(X) &= \operatorname{E}\left[\operatorname{I}_X(x)\right] \\ Eta (x) & = operatorname { e }左[ operatorname { i } _ x (x)右] :\lt math\gt g_{jk}(\theta_0) = \left.\frac{\partial^2}{\partial\theta_j\, \partial\theta_k} \right|_{\theta = \theta_0} D_\text{KL}(P(\theta) \parallel P(\theta_0)) }[/math]

        &= \log(N) - D_\text{KL}\left(p_X(x) \parallel P_U(X)\right)

& = log (n)-d _ text { KL }左(p _ x (x)并行 p _ u (x)右)


\end{align}</math>

结束{ align } </math >

must be positive semidefinite. Letting [math]\displaystyle{ \theta_0 }[/math] vary (and dropping the subindex 0) the Hessian [math]\displaystyle{ g_{jk}(\theta) }[/math] defines a (possibly degenerate) Riemannian metric on the θ parameter space, called the Fisher information metric.


is the number of bits which would have to be transmitted to identify [math]\displaystyle{ X }[/math] from [math]\displaystyle{ N }[/math] equally likely possibilities, less the Kullback–Leibler divergence of the uniform distribution on the random variates of [math]\displaystyle{ X }[/math], [math]\displaystyle{ P_U(X) }[/math], from the true distribution [math]\displaystyle{ P(X) }[/math] — i.e. less the expected number of bits saved, which would have had to be sent if the value of [math]\displaystyle{ X }[/math] were coded according to the uniform distribution [math]\displaystyle{ P_U(X) }[/math] rather than the true distribution [math]\displaystyle{ P(X) }[/math].

是为了识别 < math > n </math > 中的 < math > x </math > 而必须传输的位数,在 < math > x </math > ,< math > ,< math > </math > 的随机变量上,均匀分布的 Kullback-Leibler 散度小于 < math > x </math > ,< math > p _ u (x) </math > ,从真实分布 < math > > p (x) </math > 。如果按照均匀分布 < math > p _ u (x) </math > 而不是按照真正的分布 < math > p (x) </math > 来编码 < math > x </math > ,则需要发送的预期保存位数较少。

Fisher information metric theorem

When [math]\displaystyle{ p_{(x, \rho)} }[/math] satisfies the following regularity conditions:


The conditional entropy,

条件熵,

[math]\displaystyle{ \tfrac{\partial \log(p)}{\partial \rho}, \tfrac{\partial^2 \log(p)}{\partial \rho^2}, \tfrac{\partial^3 \log(p)}{\partial \rho^3} }[/math] exist,
[math]\displaystyle{ \begin{align} \lt math\gt \begin{align} 1.1.1.2.2.2.2.2.2.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3.3 \left|\frac{\partial p}{\partial \rho}\right| &\lt F(x): \int_{x=0}^\infty F(x)\,dx \lt \infty, \\ \Eta(X \mid Y) 预计抵达时间(x 中 y) \left|\frac{\partial^2 p}{\partial \rho^2}\right| &\lt G(x): \int_{x=0}^\infty G(x)\,dx \lt \infty \\ &= \log(N) - D_\text{KL}(P(X, Y) \parallel P_U(X) P(Y)) \\ & = log (n)-d _ text { KL }(p (x,y)并行 p _ u (x) p (y)) \left|\frac{\partial^3 \log(p)}{\partial \rho^3}\right| &\lt H(x): \int_{x=0}^\infty p(x, 0)H(x)\,dx \lt \xi \lt \infty &= \log(N) - D_\text{KL}(P(X, Y) \parallel P(X) P(Y)) - D_\text{KL}(P(X) \parallel P_U(X)) \\ & = log (n)-dtext { KL }(p (x,y)并行 p (x) p (y))-dtext { KL }(p (x)并行 p _ u (x))) \end{align} }[/math]
   &= \Eta(X) - \operatorname{I}(X; Y) \\

& = Eta (x)-操作者名{ i }(x; y)


   &= \log(N) - \operatorname{E}_Y \left[D_\text{KL}\left(P\left(X \mid Y\right) \parallel P_U(X)\right)\right]

& = log (n)-operatorname { e } _ y left [ d _ text { KL } left (p left (x mid y right) parallel p _ u (x) right)]

where ξ is independent of ρ

\end{align}</math>

结束{ align } </math >

[math]\displaystyle{ \left.\int_{x=0}^\infty \frac{\partial p(x, \rho)}{\partial \rho}\right|_{\rho=0}\, dx = is the number of bits which would have to be transmitted to identify \lt math\gt X }[/math] from [math]\displaystyle{ N }[/math] equally likely possibilities, less the Kullback–Leibler divergence of the product distribution [math]\displaystyle{ P_U(X) P(Y) }[/math] from the true joint distribution [math]\displaystyle{ P(X,Y) }[/math] — i.e. less the expected number of bits saved which would have had to be sent if the value of [math]\displaystyle{ X }[/math] were coded according to the uniform distribution [math]\displaystyle{ P_U(X) }[/math] rather than the conditional distribution [math]\displaystyle{ P(X|Y) }[/math] of [math]\displaystyle{ X }[/math] given [math]\displaystyle{ Y }[/math].

是必须传输的位数,用于识别 < math > n </math > n </math > 相等的可能性,乘积分布 < math > p _ u (x) p (y) </math > p (x) </math > p (x) </math > 。如果根据均匀分布 < math > p _ u (x) </math > 而不是条件分布 < math > p (x | y) </math > x </math > 给定的数学 < y </math > ,则需要发送的预期保存位数减少。

\left.\int_{x=0}^\infty \frac{\partial^2 p(x, \rho)}{\partial \rho^2}\right|_{\rho=0}\, dx = 0

</math>


When we have a set of possible events, coming from the distribution , we can encode them (with a lossless data compression) using entropy encoding. This compresses the data by replacing each fixed-length input symbol with a corresponding unique, variable-length, prefix-free code (e.g.: the events (A, B, C) with probabilities p = (1/2, 1/4, 1/4) can be encoded as the bits (0, 10, 11)). If we know the distribution in advance, we can devise an encoding that would be optimal (e.g.: using Huffman coding). Meaning the messages we encode will have the shortest length on average (assuming the encoded events are sampled from ), which will be equal to Shannon's Entropy of (denoted as [math]\displaystyle{ \Eta(p) }[/math]). However, if we use a different probability distribution () when creating the entropy encoding scheme, then a larger number of bits will be used (on average) to identify an event from a set of possibilities. This new (larger) number is measured by the cross entropy between and .

当我们有一组来自分布的可能事件时,我们可以使用熵编码法对它们进行编码(以无损耗的方式)。这种压缩方法是将每个固定长度的输入符号替换为相应的唯一的、可变长度的、不带前缀的代码(例如:。: 概率 p = (1/2,1/4,1/4)的事件(a,b,c)可以被编码为位(0,10,11)。如果我们事先知道分布情况,我们就可以设计出最佳的编码方式(例如:。: 使用霍夫曼编码)。这意味着我们编码的消息的平均长度将是最短的(假设已编码的事件被采样) ,这将等于香农的熵值(表示为 < math > Eta (p) </math >)。然而,如果我们在创建概率分布/熵编码法方案时使用不同的字节/字节() ,那么将使用更多的位(平均)从一组可能性中识别一个事件。这个新的(更大的)数字是通过和之间的交叉熵来衡量的。

then:

[math]\displaystyle{ \mathcal{D}(p(x, 0) \parallel p(x, \rho)) = \frac{c\rho^2}{2} + \mathcal{O}\left(\rho^3\right) \text{ as } \rho \to 0. }[/math]

The cross entropy between two probability distributions ( and ) measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution , rather than the "true" distribution . The cross entropy for two distributions and over the same probability space is thus defined as follows:

如果一个编码方案是基于给定的概率分布,而不是基于“真实”分布,那么两个概率分布(和)之间的交叉熵度量了从一组可能性中识别一个事件所需的平均位数。因此,在相同的概率空间上,两种分布的交叉熵定义如下:


Variation of information

[math]\displaystyle{ \Eta(p, q) = \operatorname{E}_p[-\log(q)] = \Eta(p) + D_\text{KL}(p \parallel q). }[/math]

< math > Eta (p,q) = 操作者名{ e } _ p [-log (q)] = Eta (p) + d _ text { KL }(p 并行 q)。 </math >

Another information-theoretic metric is Variation of information, which is roughly a symmetrization of conditional entropy. It is a metric on the set of partitions of a discrete probability space.


Under this scenario, the KL divergences can be interpreted as the extra number of bits, on average, that are needed (beyond [math]\displaystyle{ \Eta(p) }[/math]) for encoding the events because of using for constructing the encoding scheme instead of .

在这种情况下,KL 偏离可以被解释为编码事件所需的额外位数(超过 < math > Eta (p) </math >) ,因为使用构造编码模式而不是。

Relation to other quantities of information theory

Many of the other quantities of information theory can be interpreted as applications of the Kullback–Leibler divergence to specific cases.


In Bayesian statistics the Kullback–Leibler divergence can be used as a measure of the information gain in moving from a prior distribution to a posterior distribution: [math]\displaystyle{ p(x) \to p(x\mid I) }[/math]. If some new fact [math]\displaystyle{ Y = y }[/math] is discovered, it can be used to update the posterior distribution for [math]\displaystyle{ X }[/math] from [math]\displaystyle{ p(x\mid I) }[/math] to a new posterior distribution [math]\displaystyle{ p(x\mid y,I) }[/math] using Bayes' theorem:

20世纪贝叶斯统计,Kullback-Leibler 分歧可以用来衡量从先前分布到后验概率分布的信息增益:。如果发现了一些新的事实,它可以用来更新数学后验概率,从数学公式 p (x mid i)到一个新的后验概率公式 p (x mid y,i):

Self-information

[math]\displaystyle{ p(x \mid y, I) = \frac{p(y \mid x, I) p(x \mid I)}{p(y \mid I)} }[/math]

P (x mid y,i) = frac { p (y mid x,i) p (x mid i)}{ p (y mid i)}} </math >

The self-information, also known as the information content of a signal, random variable, or event is defined as the negative logarithm of the probability of the given outcome occurring.


This distribution has a new entropy:

这个分布有一个新的熵:

When applied to a discrete random variable, the self-information can be represented as[citation needed]


[math]\displaystyle{ \Eta\big(p(x \mid y, I)\big) = -\sum_x p(x \mid y, I) \log p(x \mid y, I), }[/math]

Eta big (p (x mid y,i) big) =-sum _ xp (x mid y,i) log p (x mid y,i) ,</math >

[math]\displaystyle{ \operatorname \operatorname{I}(m) = D_\text{KL}\left(\delta_\text{im} \parallel \{p_i\}\right), }[/math]


which may be less than or greater than the original entropy [math]\displaystyle{ \Eta(p(x\mid I)) }[/math]. However, from the standpoint of the new probability distribution one can estimate that to have used the original code based on [math]\displaystyle{ p(x\mid I) }[/math] instead of a new code based on [math]\displaystyle{ p(x\mid y, I) }[/math] would have added an expected number of bits:

它可能小于或大于原始的熵。然而,从新概率分布的角度来看,人们可以估计,如果使用基于 < math > p (x mid i) </math > 的原始代码,而不是基于 < math > p (x mid y,i) </math > 的新代码,那么可能会增加一个预期的位数:

is the Kullback–Leibler divergence of the probability distribution [math]\displaystyle{ P(i) }[/math] from a Kronecker delta representing certainty that [math]\displaystyle{ i = m }[/math] — i.e. the number of extra bits that must be transmitted to identify [math]\displaystyle{ i }[/math] if only the probability distribution [math]\displaystyle{ P(i) }[/math] is available to the receiver, not the fact that [math]\displaystyle{ i = m }[/math].


[math]\displaystyle{ D_\text{KL}\big(p(x \mid y, I) \parallel p(x \mid I) \big) = \sum_x p(x \mid y, I) \log\left(\frac{p(x \mid y, I)}{p(x \mid I)}\right) }[/math]

(p (x mid y,i) parallel p (x mid i) big) = sum _ xp (x mid y,i) log left (frac { p (x mid y,i)}{ p (x mid i)} right) </math >

Mutual information

The mutual information,[citation needed]

to the message length. This therefore represents the amount of useful information, or information gain, about [math]\displaystyle{ X }[/math], that we can estimate has been learned by discovering [math]\displaystyle{ Y = y }[/math].

和信息长度一致。因此,这代表了关于《 math 》的有用信息的数量,或者说信息增益,我们可以通过发现《 math 》 y = y </math > 来估计这些信息已经学会了多少。


[math]\displaystyle{ \begin{align} If a further piece of data, \lt math\gt Y_2 = y_2 }[/math], subsequently comes in, the probability distribution for [math]\displaystyle{ x }[/math] can be updated further, to give a new best guess [math]\displaystyle{ p(x \mid y_1, y_2, I) }[/math]. If one reinvestigates the information gain for using [math]\displaystyle{ p(x \mid y_1,I) }[/math] rather than [math]\displaystyle{ p(x \mid I) }[/math], it turns out that it may be either greater or less than previously estimated:

如果进一步的数据---- 《 math > y _ 2 = y _ 2 </math >---- 随后出现,《 math > x </math > 的概率分布可以进一步更新,给出一个新的最佳猜测 p (x mid y _ 1,y _ 2,i) </math > 。如果一个人重新调查使用 < math > p (x mid y _ 1,i) </math > 而不是 < math > p (x mid i) </math > 的信息增益,结果可能比先前估计的要大或要小:

\operatorname{I}(X; Y)
  &= D_\text{KL}(P(X, Y) \parallel P(X)P(Y)) \\

[math]\displaystyle{ \sum_x p(x \mid y_1, y_2, I) \log\left(\frac{p(x \mid y_1, y_2, I)}{p(x \mid I)}\right) }[/math] may be ≤ or > than [math]\displaystyle{ \displaystyle \sum_x p(x \mid y_1, I) \log\left(\frac{p(x \mid y_1, I)}{p(x \mid I)}\right) }[/math]

(x mid y _ 1,y _ 2,i) log left (frac { p (x mid y _ 1,y _ 2,i)}{ p (x mid i)}右) </math > 可能≤或 > 于 < math > 显式 sum _ p (x mid y _ 1,i) log left (frac { p (x mid y _ 1,i)}{ p (x mid i)}右) </math >

  &= \operatorname{E}_X \{D_\text{KL}(P(Y \mid X) \parallel P(Y))\} \\
  &= \operatorname{E}_Y \{D_\text{KL}(P(X \mid Y) \parallel P(X))\}

and so the combined information gain does not obey the triangle inequality:

因此,合并后的信息收益不服从三角不等式:

\end{align}</math>


[math]\displaystyle{ D_\text{KL} \big( p(x \mid y_1, y_2, I) \parallel p(x \mid I) \big) }[/math] may be <, = or > than [math]\displaystyle{ D_\text{KL}\big( p(x \mid y_1, y_2, I) \parallel p(x \mid y_1, I)\big) + D_\text{KL}\big(p(x \mid y_1, I) \parallel p(x \mid I)\big) }[/math]

Big (p (x mid y _ 1,y _ 2,i) parallal p (x mid i) big) </math > 可能比 < ,= 或 > < d text { KL } big (p (x mid y _ 1,y _ 2,i) parallal p (x mid y _ 1,i) big) + d text { KL } big (p (x mid y _ 1,i) parallal p (x mid i) big) </math >

is the Kullback–Leibler divergence of the product [math]\displaystyle{ P(X)P(Y) }[/math] of the two marginal probability distributions from the joint probability distribution [math]\displaystyle{ P(X,Y) }[/math] — i.e. the expected number of extra bits that must be transmitted to identify [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] if they are coded using only their marginal distributions instead of the joint distribution. Equivalently, if the joint probability [math]\displaystyle{ P(X,Y) }[/math] is known, it is the expected number of extra bits that must on average be sent to identify [math]\displaystyle{ Y }[/math] if the value of [math]\displaystyle{ X }[/math] is not already known to the receiver.


All one can say is that on average, averaging using [math]\displaystyle{ p(y_2 \mid y_1, x, I) }[/math], the two sides will average out.

我们只能说,平均起来,用 p (y _ 2 mid y _ 1,x,i) </math > ,两边平均。

Shannon entropy

The Shannon entropy,[citation needed]


A common goal in Bayesian experimental design is to maximise the expected Kullback–Leibler divergence between the prior and the posterior. When posteriors are approximated to be Gaussian distributions, a design maximising the expected Kullback–Leibler divergence is called Bayes d-optimal.

贝叶斯实验设计的一个共同目标是最大限度地扩大先验和后验之间的预期 Kullback-Leibler 差异。当后部近似为高斯分布时,最大化期望 Kullback-Leibler 散度的设计称为 Bayes d- 最优。

[math]\displaystyle{ \begin{align} \Eta(X) &= \operatorname{E}\left[\operatorname{I}_X(x)\right] \\ &= \log(N) - D_\text{KL}\left(p_X(x) \parallel P_U(X)\right) The Kullback–Leibler divergence \lt math display="inline"\gt D_\text{KL}\bigl(p(x \mid H_1) \parallel p(x \mid H_0)\bigr) }[/math] can also be interpreted as the expected discrimination information for [math]\displaystyle{ H_1 }[/math] over [math]\displaystyle{ H_0 }[/math]: the mean information per sample for discriminating in favor of a hypothesis [math]\displaystyle{ H_1 }[/math] against a hypothesis [math]\displaystyle{ H_0 }[/math], when hypothesis [math]\displaystyle{ H_1 }[/math] is true. Another name for this quantity, given to it by I. J. Good, is the expected weight of evidence for [math]\displaystyle{ H_1 }[/math] over [math]\displaystyle{ H_0 }[/math] to be expected from each sample.

当数学假设 < h _ 1 </math > 是正确的时候,Kullback-Leibler 分歧 < math > > d _ text { KL } bigl (p (x mid h _ 1) parallel p (x mid h _ 0) bigr) </math > 也可以被解释为 < math > h _ 1 </math > < math > < h _ 0 </math > 的预期辨别信息: 每个样本辨别数学假设 < h _ 1 </math > 反对数学假设 < h _ 0 </math > 的平均信息。这个量的另一个名称,由 i. j。很好,对于每个样本来说,《数学》杂志对于《数学》杂志对于《数学》杂志对于《数学》杂志对于《数学》杂志对于《数学》杂志对于《数学》杂志对于《数学》杂志对于《数学》杂志对于《数学》杂志对于《数学》杂志对于《数学》杂志对于《数学》杂志对于《数学》杂志对于《数学》杂志对于《数学》杂志对于《数学》杂志对于《数学》杂志对于《数学》杂志对于数学》杂志对于《数学》杂志对于每个样本的期望值。

\end{align}</math>


The expected weight of evidence for [math]\displaystyle{ H_1 }[/math] over [math]\displaystyle{ H_0 }[/math] is not the same as the information gain expected per sample about the probability distribution [math]\displaystyle{ p(H) }[/math] of the hypotheses,

这些假设的数学性质与每个样本所期望得到的信息量是不一样的,这些数学性质的数学性质与数学性质的数学性质不同,这些数学性质的数学性质与数学性质的数学性质不同,

is the number of bits which would have to be transmitted to identify [math]\displaystyle{ X }[/math] from [math]\displaystyle{ N }[/math] equally likely possibilities, less the Kullback–Leibler divergence of the uniform distribution on the random variates of [math]\displaystyle{ X }[/math], [math]\displaystyle{ P_U(X) }[/math], from the true distribution [math]\displaystyle{ P(X) }[/math] — i.e. less the expected number of bits saved, which would have had to be sent if the value of [math]\displaystyle{ X }[/math] were coded according to the uniform distribution [math]\displaystyle{ P_U(X) }[/math] rather than the true distribution [math]\displaystyle{ P(X) }[/math].


[math]\displaystyle{ D_\text{KL}(p(x \mid H_1) \parallel p(x \mid H_0)) \neq IG = D_\text{KL}(p(H \mid x) \parallel p(H \mid I)). }[/math]

(p (x mid h _ 1) parallel p (x mid h _ 0)) neq IG = d _ text { KL }(p (h mid x) parallel p (h mid i)) </math >

Conditional entropy

The conditional entropy模板:R,[citation needed]

Either of the two quantities can be used as a utility function in Bayesian experimental design, to choose an optimal next question to investigate: but they will in general lead to rather different experimental strategies.

这两个量中的任何一个都可以作为贝叶斯实验设计中的效用函数,用来选择下一个要研究的最优问题: 但它们通常会导致相当不同的实验策略。


[math]\displaystyle{ \begin{align} On the entropy scale of information gain there is very little difference between near certainty and absolute certainty—coding according to a near certainty requires hardly any more bits than coding according to an absolute certainty. On the other hand, on the logit scale implied by weight of evidence, the difference between the two is enormous – infinite perhaps; this might reflect the difference between being almost sure (on a probabilistic level) that, say, the Riemann hypothesis is correct, compared to being certain that it is correct because one has a mathematical proof. These two different scales of loss function for uncertainty are both useful, according to how well each reflects the particular circumstances of the problem in question. 在信息增益的熵尺度上,接近确定性和绝对确定性之间的差别很小ーー基于接近确定性的编码几乎不需要比基于绝对确定性的编码多任何比特。另一方面,从证据权重所暗示的逻辑尺度来看,两者之间的差别是巨大的——也许是无限的; 这可能反映了几乎肯定(在概率层面上) ,比如说,黎曼猜想是正确的与确定它是正确的之间的差别,因为一个人有数学证明。这两种不同尺度的不确定性损失函数都是有用的,取决于它们如何很好地反映有关问题的特定情况。 \Eta(X \mid Y) &= \log(N) - D_\text{KL}(P(X, Y) \parallel P_U(X) P(Y)) \\ &= \log(N) - D_\text{KL}(P(X, Y) \parallel P(X) P(Y)) - D_\text{KL}(P(X) \parallel P_U(X)) \\ The idea of Kullback–Leibler divergence as discrimination information led Kullback to propose the Principle of (MDI): given new facts, a new distribution \lt math\gt f }[/math] should be chosen which is as hard to discriminate from the original distribution [math]\displaystyle{ f_0 }[/math] as possible; so that the new data produces as small an information gain [math]\displaystyle{ D_\text{KL}(f \parallel f_0) }[/math] as possible.

Kullback-Leibler 分歧作为歧视信息的想法促使 Kullback 提出了(MDI)原则: 给定新的事实,应该选择一个新的分布,这个分布尽可能难以与原始分布区分开来,这样新的数据就能产生尽可能小的信息增益。

   &= \Eta(X) - \operatorname{I}(X; Y) \\
   &= \log(N) - \operatorname{E}_Y \left[D_\text{KL}\left(P\left(X \mid Y\right) \parallel P_U(X)\right)\right]

For example, if one had a prior distribution [math]\displaystyle{ p(x,a) }[/math] over [math]\displaystyle{ x }[/math] and [math]\displaystyle{ a }[/math], and subsequently learnt the true distribution of [math]\displaystyle{ a }[/math] was [math]\displaystyle{ u(a) }[/math], then the Kullback–Leibler divergence between the new joint distribution for [math]\displaystyle{ x }[/math] and [math]\displaystyle{ a }[/math], [math]\displaystyle{ q(x\mid a)u(a) }[/math], and the earlier prior distribution would be:

例如,如果一个人有先验分布 < math > p (x,a) < math > 和 < math > a </math > ,并且随后学会 < math > a </math > 的真正分布是 < math > u (a) </math > ,那么 Kullback-Leibler 分布在 < math > x </math > 和数学 < a </math > ,< math > q (x mid a) u (a) </math > 和较早的先验分布是:

\end{align}</math>


[math]\displaystyle{ D_\text{KL}(q(x \mid a)u(a) \parallel p(x, a)) = \operatorname{E}_{u(a)}\left\{D_\text{KL}(q(x \mid a) \parallel p(x \mid a))\right\} + D_\text{KL}(u(a) \parallel p(a)), }[/math]

D text { KL }(q (x mid a) u (a) parallel p (x,a)) = operatorname { e }{ u (a)}左{ d text { KL }(q (x mid a) parallel p (x mid a))右} + d text { KL }(u (a) parallel p (a)) ,</math >

is the number of bits which would have to be transmitted to identify [math]\displaystyle{ X }[/math] from [math]\displaystyle{ N }[/math] equally likely possibilities, less the Kullback–Leibler divergence of the product distribution [math]\displaystyle{ P_U(X) P(Y) }[/math] from the true joint distribution [math]\displaystyle{ P(X,Y) }[/math] — i.e. less the expected number of bits saved which would have had to be sent if the value of [math]\displaystyle{ X }[/math] were coded according to the uniform distribution [math]\displaystyle{ P_U(X) }[/math] rather than the conditional distribution [math]\displaystyle{ P(X|Y) }[/math] of [math]\displaystyle{ X }[/math] given [math]\displaystyle{ Y }[/math].


i.e. the sum of the Kullback–Leibler divergence of [math]\displaystyle{ p(a) }[/math] the prior distribution for [math]\displaystyle{ a }[/math] from the updated distribution [math]\displaystyle{ u(a) }[/math], plus the expected value (using the probability distribution [math]\displaystyle{ u(a) }[/math]) of the Kullback–Leibler divergence of the prior conditional distribution [math]\displaystyle{ p(x\mid a) }[/math] from the new conditional distribution [math]\displaystyle{ q(x\mid a) }[/math]. (Note that often the later expected value is called the conditional Kullback–Leibler divergence (or conditional relative entropy) and denoted by [math]\displaystyle{ D_\text{KL}(q(x\mid a) \parallel p(x\mid a)) }[/math]

也就是。优先分布 < math > a </math > a </math > 来自更新分布 < math > u (a) </math > 的 Kullback-Leibler 散度的和,加上优先条件分布 < math > p (x mid a) </math > 的 Kullback-Leibler 散度的期望值(使用概率分布 < math > u (a) </math >)。(请注意,后期的期望值通常被称为条件 Kullback-Leibler 散度(或条件相对熵) ,用 < math > d _ text { KL }(q (x mid a) parallel p (x mid a)) </math > 表示

Cross entropy

) This is minimized if [math]\displaystyle{ q(x\mid a)=p(x\mid a) }[/math] over the whole support of [math]\displaystyle{ u(a) }[/math]; and we note that this result incorporates Bayes' theorem, if the new distribution [math]\displaystyle{ u(a) }[/math] is in fact a δ function representing certainty that [math]\displaystyle{ a }[/math] has one particular value.

如果新的分布 < math > u (a) </math > 实际上是一个 δ 函数,它表示 < math > </math > 有一个特定的值,那么这个结果就包含了贝叶斯定理。

When we have a set of possible events, coming from the distribution p, we can encode them (with a lossless data compression) using entropy encoding. This compresses the data by replacing each fixed-length input symbol with a corresponding unique, variable-length, prefix-free code (e.g.: the events (A, B, C) with probabilities p = (1/2, 1/4, 1/4) can be encoded as the bits (0, 10, 11)). If we know the distribution p in advance, we can devise an encoding that would be optimal (e.g.: using Huffman coding). Meaning the messages we encode will have the shortest length on average (assuming the encoded events are sampled from p), which will be equal to Shannon's Entropy of p (denoted as [math]\displaystyle{ \Eta(p) }[/math]). However, if we use a different probability distribution (q) when creating the entropy encoding scheme, then a larger number of bits will be used (on average) to identify an event from a set of possibilities. This new (larger) number is measured by the cross entropy between p and q.


MDI can be seen as an extension of Laplace's Principle of Insufficient Reason, and the Principle of Maximum Entropy of E.T. Jaynes. In particular, it is the natural extension of the principle of maximum entropy from discrete to continuous distributions, for which Shannon entropy ceases to be so useful (see differential entropy), but the Kullback–Leibler divergence continues to be just as relevant.

MDI 可以看作是拉普拉斯不充分理性原则的延伸,也可以看作是 e.t. 的最大熵原理。女名女子名。特别是,它是最大熵原理从离散分布到连续分布的自然延伸,对此,香农熵不再那么有用了(见微分熵) ,但 Kullback-Leibler 散度仍然是相关的。

The cross entropy between two probability distributions (p and q) measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution q, rather than the "true" distribution p. The cross entropy for two distributions p and q over the same probability space is thus defined as follows:[citation needed]


In the engineering literature, MDI is sometimes called the Principle of Minimum Cross-Entropy (MCE) or Minxent for short. Minimising the Kullback–Leibler divergence from [math]\displaystyle{ m }[/math] to [math]\displaystyle{ p }[/math] with respect to [math]\displaystyle{ m }[/math] is equivalent to minimizing the cross-entropy of [math]\displaystyle{ p }[/math] and [math]\displaystyle{ m }[/math], since

在工程文献中,MDI 有时被称为最小交叉熵原理(MCE)或简称 Minxent。将 Kullback-Leibler 对于 < math > 和 < math > 的分歧最小化相当于将 < math > 和 < math > 的交叉熵最小化,因为

[math]\displaystyle{ \Eta(p, q) = \operatorname{E}_p[-\log(q)] = \Eta(p) + D_\text{KL}(p \parallel q). }[/math]


[math]\displaystyle{ \Eta(p, m) = \Eta(p) + D_\text{KL}(p \parallel m), }[/math]

[ math ] Eta (p,m) = Eta (p) + d _ text { KL }(p parallel m) ,[/math ]

Under this scenario, the KL divergences can be interpreted as the extra number of bits, on average, that are needed (beyond [math]\displaystyle{ \Eta(p) }[/math]) for encoding the events because of using q for constructing the encoding scheme instead of p.


which is appropriate if one is trying to choose an adequate approximation to [math]\displaystyle{ p }[/math]. However, this is just as often not the task one is trying to achieve. Instead, just as often it is [math]\displaystyle{ m }[/math] that is some fixed prior reference measure, and [math]\displaystyle{ p }[/math] that one is attempting to optimise by minimising [math]\displaystyle{ D_\text{KL}(p \parallel m) }[/math] subject to some constraint. This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by redefining cross-entropy to be [math]\displaystyle{ D_\text{KL}(p \parallel m) }[/math], rather than [math]\displaystyle{ \Eta(p,m) }[/math].

如果一个人试图选择一个适当的近似值来计算 p </math > ,这是合适的。然而,这往往不是一个人想要完成的任务。取而代之的是,通常是 < math > m </math > 这是某种固定的先验参考措施,而 < math > p </math > 这是一个人试图通过最小化 < math > d _ text { KL }(p parallel m) </math > 来优化某个约束条件。这导致了文献中的一些模糊性,一些作者试图通过将交叉熵重新定义为 < math > d _ text { KL }(p parallel m) </math > 而不是 < math > Eta (p,m) </math > 来解决这种不一致性。

Bayesian updating

In Bayesian statistics the Kullback–Leibler divergence can be used as a measure of the information gain in moving from a prior distribution to a posterior distribution: [math]\displaystyle{ p(x) \to p(x\mid I) }[/math]. If some new fact [math]\displaystyle{ Y = y }[/math] is discovered, it can be used to update the posterior distribution for [math]\displaystyle{ X }[/math] from [math]\displaystyle{ p(x\mid I) }[/math] to a new posterior distribution [math]\displaystyle{ p(x\mid y,I) }[/math] using Bayes' theorem:


Pressure versus volume plot of available work from a mole of argon gas relative to ambient, calculated as [math]\displaystyle{ T_o }[/math] times the Kullback–Leibler divergence.

一摩尔氩气相对于周围环境的可用功的压力与体积图,计算方法为 Kullback-Leibler 散度乘以 < math > t-o </math > 。

[math]\displaystyle{ p(x \mid y, I) = \frac{p(y \mid x, I) p(x \mid I)}{p(y \mid I)} }[/math]

Surprisals add where probabilities multiply. The surprisal for an event of probability [math]\displaystyle{ p }[/math] is defined as [math]\displaystyle{ s = k \ln(1 / p) }[/math]. If [math]\displaystyle{ k }[/math] is [math]\displaystyle{ \left\{1, 1/\ln 2, 1.38 \times 10^{-23}\right\} }[/math] then surprisal is in [math]\displaystyle{ \{ }[/math]nats, bits, or [math]\displaystyle{ J/K\} }[/math] so that, for instance, there are [math]\displaystyle{ N }[/math] bits of surprisal for landing all "heads" on a toss of [math]\displaystyle{ N }[/math] coins.

惊喜会在概率乘以的地方添加。对于概率 < math > p </math > 的事件,惊奇的定义是 < math > s = k ln (1/p) </math > 。如果 < math > k </math > 是 < math > > 左{1,1/ln 2,1.38乘以10 ^ {-23}右} </math > 那么惊喜就在 < math > > nats,bits,或 < math > J/K } </math > 所以,例如,有 < math > n </math > 惊喜位为了让所有的“人头”落在一堆 < math > n </math > 硬币上。


This distribution has a new entropy:

Best-guess states (e.g. for atoms in a gas) are inferred by maximizing the average surprisal [math]\displaystyle{ S }[/math] (entropy) for a given set of control parameters (like pressure [math]\displaystyle{ P }[/math] or volume [math]\displaystyle{ V }[/math]). This constrained entropy maximization, both classically and quantum mechanically, minimizes Gibbs availability in entropy units [math]\displaystyle{ A \equiv -k\ln(Z) }[/math] where [math]\displaystyle{ Z }[/math] is a constrained multiplicity or partition function.

最佳猜测状态(例如:。气体中的原子)通过最大化给定控制参数(如压力 < math > p </math > 或体积 < math > v </math >)的平均惊奇值 s </math > (熵)来推断。这种约束熵最大化,无论是经典的还是量子力学的,最小化了以熵单位 < math > a equiv-k ln (z) </math > 其中 < math > z </math > 是一个约束多重数或配分函数。


[math]\displaystyle{ \Eta\big(p(x \mid y, I)\big) = -\sum_x p(x \mid y, I) \log p(x \mid y, I), }[/math]

When temperature [math]\displaystyle{ T }[/math] is fixed, free energy ([math]\displaystyle{ T \times A }[/math]) is also minimized. Thus if [math]\displaystyle{ T, V }[/math] and number of molecules [math]\displaystyle{ N }[/math] are constant, the Helmholtz free energy [math]\displaystyle{ F \equiv U - TS }[/math] (where [math]\displaystyle{ U }[/math] is energy) is minimized as a system "equilibrates." If [math]\displaystyle{ T }[/math] and [math]\displaystyle{ P }[/math] are held constant (say during processes in your body), the Gibbs free energy [math]\displaystyle{ G = U + PV - TS }[/math] is minimized instead. The change in free energy under these conditions is a measure of available work that might be done in the process. Thus available work for an ideal gas at constant temperature [math]\displaystyle{ T_o }[/math] and pressure [math]\displaystyle{ P_o }[/math] is [math]\displaystyle{ W = \Delta G = NkT_o\Theta(V/V_o) }[/math] where [math]\displaystyle{ V_o = NkT_o/P_o }[/math] and [math]\displaystyle{ \Theta(x) = x - 1 - \ln x \ge 0 }[/math] (see also Gibbs inequality).

当温度 < math > t </math > 固定时,自由能(< math > t 乘以 a </math >)也最小化。因此,如果 < math > t,v </math > 和分子数 < math > n </math > 是常量,那么亥姆霍兹自由能 < math > f = u-TS </math > (其中 < math > u </math > 是能量)作为一个平衡系统被最小化如果 < math > t </math > 和 < math > p </math > 保持为常量(比如在你的身体进程中) ,那么吉布斯自由能 < math > g = u + PV-TS </math > 将被最小化。在这些条件下,自由能的变化是对过程中可能做的功的度量。因此,理想气体在恒定温度和压力下的有效工作是 < math > w = Delta g = NkT _ o Theta (V/V _ o) </math > 其中 v _ o = NkT _ o/p _ o </math > 和 < math > Theta (x) = x-1-ln x ge </math > (也见 Gibbs 不等式)。


which may be less than or greater than the original entropy [math]\displaystyle{ \Eta(p(x\mid I)) }[/math]. However, from the standpoint of the new probability distribution one can estimate that to have used the original code based on [math]\displaystyle{ p(x\mid I) }[/math] instead of a new code based on [math]\displaystyle{ p(x\mid y, I) }[/math] would have added an expected number of bits:

More generally the work available relative to some ambient is obtained by multiplying ambient temperature [math]\displaystyle{ T_o }[/math] by Kullback–Leibler divergence or net surprisal [math]\displaystyle{ \Delta I \ge 0, }[/math] defined as the average value of [math]\displaystyle{ k\ln(p/p_o) }[/math] where [math]\displaystyle{ p_o }[/math] is the probability of a given state under ambient conditions. For instance, the work available in equilibrating a monatomic ideal gas to ambient values of [math]\displaystyle{ V_o }[/math] and [math]\displaystyle{ T_o }[/math] is thus [math]\displaystyle{ W = T_o \Delta I }[/math], where Kullback–Leibler divergence

更一般地说,相对于某个环境而言,可用的工作量是通过环境温度乘以 Kullback-Leibler 散度或者网络惊奇值△ i ge 0,</math > 定义为 < math > k ln (p/p _ o) </math > 其中 < math > p _ o </math > 是环境条件下给定状态的概率。例如,将单原子理想气体平衡到 < math > 和 < math > > > t _ o </math > 的环境值的工作就是 < math > w = t _ o Delta i </math > ,就是 Kullback-Leibler 发散的地方


[math]\displaystyle{ D_\text{KL}\big(p(x \mid y, I) \parallel p(x \mid I) \big) = \sum_x p(x \mid y, I) \log\left(\frac{p(x \mid y, I)}{p(x \mid I)}\right) }[/math]

[math]\displaystyle{ \Delta I = Nk\left[\Theta\left(\frac{V}{V_o}\right) + \frac{3}{2}\Theta\left(\frac{T}{T_o}\right)\right]. }[/math]

[数学] Delta i = Nk left [ Theta left (frac { v }{ v _ o } right) + frac {3} Theta left (frac { t _ o } right)] . </math >


to the message length. This therefore represents the amount of useful information, or information gain, about [math]\displaystyle{ X }[/math], that we can estimate has been learned by discovering [math]\displaystyle{ Y = y }[/math].

The resulting contours of constant Kullback–Leibler divergence, shown at right for a mole of Argon at standard temperature and pressure, for example put limits on the conversion of hot to cold as in flame-powered air-conditioning or in the unpowered device to convert boiling-water to ice-water discussed here. Thus Kullback–Leibler divergence measures thermodynamic availability in bits.

例如,在标准温度和压力下,右边显示的一摩尔氩的恒定 Kullback-Leibler 散度的等值线,限制了热到冷的转换,如在火焰驱动的空调中或在本文讨论的将沸水转换成冰水的无动力装置中。因此 Kullback-Leibler 散度度量了比特的热力学可用性。


If a further piece of data, [math]\displaystyle{ Y_2 = y_2 }[/math], subsequently comes in, the probability distribution for [math]\displaystyle{ x }[/math] can be updated further, to give a new best guess [math]\displaystyle{ p(x \mid y_1, y_2, I) }[/math]. If one reinvestigates the information gain for using [math]\displaystyle{ p(x \mid y_1,I) }[/math] rather than [math]\displaystyle{ p(x \mid I) }[/math], it turns out that it may be either greater or less than previously estimated:


For density matrices [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] on a Hilbert space, the K–L divergence (or quantum relative entropy as it is often called in this case) from [math]\displaystyle{ Q }[/math] to [math]\displaystyle{ P }[/math] is defined to be

对于希尔伯特空间上的密度矩阵 < math > p </math > 和 < math > q </math > ,从 < math > q </math > 到 < math > p </math > 的 k-l 散度(或者本例中通常称为量子相对熵)被定义为

[math]\displaystyle{ \sum_x p(x \mid y_1, y_2, I) \log\left(\frac{p(x \mid y_1, y_2, I)}{p(x \mid I)}\right) }[/math] may be ≤ or > than [math]\displaystyle{ \displaystyle \sum_x p(x \mid y_1, I) \log\left(\frac{p(x \mid y_1, I)}{p(x \mid I)}\right) }[/math]


[math]\displaystyle{ D_\text{KL}(P \parallel Q) = \operatorname{Tr}(P(\log(P) - \log(Q))). }[/math]

(p 并行 q) = 操作者名{ Tr }(p (log (p)-log (q)))

and so the combined information gain does not obey the triangle inequality:


In quantum information science the minimum of [math]\displaystyle{ D_\text{KL}(P\parallel Q) }[/math] over all separable states [math]\displaystyle{ Q }[/math] can also be used as a measure of entanglement in the state [math]\displaystyle{ P }[/math].

在量子信息科学中,所有可分离态的最小量也可以用来测量量子态的纠缠度。

[math]\displaystyle{ D_\text{KL} \big( p(x \mid y_1, y_2, I) \parallel p(x \mid I) \big) }[/math] may be <, = or > than [math]\displaystyle{ D_\text{KL}\big( p(x \mid y_1, y_2, I) \parallel p(x \mid y_1, I)\big) + D_\text{KL}\big(p(x \mid y_1, I) \parallel p(x \mid I)\big) }[/math]


All one can say is that on average, averaging using [math]\displaystyle{ p(y_2 \mid y_1, x, I) }[/math], the two sides will average out.


Bayesian experimental design

A common goal in Bayesian experimental design is to maximise the expected Kullback–Leibler divergence between the prior and the posterior.[12] When posteriors are approximated to be Gaussian distributions, a design maximising the expected Kullback–Leibler divergence is called Bayes d-optimal.

Just as Kullback–Leibler divergence of "actual from ambient" measures thermodynamic availability, Kullback–Leibler divergence of "reality from a model" is also useful even if the only clues we have about reality are some experimental measurements. In the former case Kullback–Leibler divergence describes distance to equilibrium or (when multiplied by ambient temperature) the amount of available work, while in the latter case it tells you about surprises that reality has up its sleeve or, in other words, how much the model has yet to learn.

正如 Kullback-Leibler 的“实际来自环境”的散度测量热力学可用性一样,Kullback-Leibler 的“实际来自模型”的散度也是有用的,即使我们关于实际的唯一线索是一些实验测量。在前一种情况下,Kullback-Leibler 散度描述到平衡点的距离,或者(当乘以环境温度)可用功的数量,而在后一种情况下,散度告诉你现实已经暗藏玄机,或者换句话说,模型还有多少需要学习。


Discrimination information

Although this tool for evaluating models against systems that are accessible experimentally may be applied in any field, its application to selecting a statistical model via Akaike information criterion are particularly well described in papers and a book by Burnham and Anderson. In a nutshell the Kullback–Leibler divergence of reality from a model may be estimated, to within a constant additive term, by a function of the deviations observed between data and the model's predictions (like the mean squared deviation) . Estimates of such divergence for models that share the same additive term can in turn be used to select among models.

虽然这个评估模型的工具可以用于任何领域,但是它在通过赤池信息量准则选择统计模型方面的应用在 Burnham 和 Anderson 的论文和书中有很好的描述。简而言之,现实与模型之间的差异可以通过观察到的数据与模型预测之间的偏差(如均方偏差)的函数,在一个固定的累加项内进行估计。对于具有相同附加项的模型,这种差异的估计可以反过来用于在模型中进行选择。

The Kullback–Leibler divergence [math]\displaystyle{ D_\text{KL}\bigl(p(x \mid H_1) \parallel p(x \mid H_0)\bigr) }[/math] can also be interpreted as the expected discrimination information for [math]\displaystyle{ H_1 }[/math] over [math]\displaystyle{ H_0 }[/math]: the mean information per sample for discriminating in favor of a hypothesis [math]\displaystyle{ H_1 }[/math] against a hypothesis [math]\displaystyle{ H_0 }[/math], when hypothesis [math]\displaystyle{ H_1 }[/math] is true.[13] Another name for this quantity, given to it by I. J. Good, is the expected weight of evidence for [math]\displaystyle{ H_1 }[/math] over [math]\displaystyle{ H_0 }[/math] to be expected from each sample.


When trying to fit parametrized models to data there are various estimators which attempt to minimize Kullback–Leibler divergence, such as maximum likelihood and maximum spacing estimators.

在试图将参数化模型拟合到数据中时,有各种各样的估计器试图使 Kullback-Leibler 散度最小,例如最大似然估计器和最大间距估计器。

The expected weight of evidence for [math]\displaystyle{ H_1 }[/math] over [math]\displaystyle{ H_0 }[/math] is not the same as the information gain expected per sample about the probability distribution [math]\displaystyle{ p(H) }[/math] of the hypotheses,


[math]\displaystyle{ D_\text{KL}(p(x \mid H_1) \parallel p(x \mid H_0)) \neq IG = D_\text{KL}(p(H \mid x) \parallel p(H \mid I)). }[/math]

Kullback and Leibler themselves actually defined the divergence as:

和 Leibler 自己将分歧定义为:


Either of the two quantities can be used as a utility function in Bayesian experimental design, to choose an optimal next question to investigate: but they will in general lead to rather different experimental strategies.

[math]\displaystyle{ D_\text{KL}(P \parallel Q) + D_\text{KL}(Q \parallel P) }[/math]

(p 并行 q) + d _ text { KL }(q 并行 p) </math >


On the entropy scale of information gain there is very little difference between near certainty and absolute certainty—coding according to a near certainty requires hardly any more bits than coding according to an absolute certainty. On the other hand, on the logit scale implied by weight of evidence, the difference between the two is enormous – infinite perhaps; this might reflect the difference between being almost sure (on a probabilistic level) that, say, the Riemann hypothesis is correct, compared to being certain that it is correct because one has a mathematical proof. These two different scales of loss function for uncertainty are both useful, according to how well each reflects the particular circumstances of the problem in question.

which is symmetric and nonnegative. This quantity has sometimes been used for feature selection in classification problems, where [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] are the conditional pdfs of a feature under two different classes. In the Banking and Finance industries, this quantity is referred to as Population Stability Index, and is used to assess distributional shifts in model features through time.

是对称和非负的。这个量有时被用于分类问题中的特征选择,其中 < math > p </math > 和 < math > q </math > 是两个不同类别下特征的条件 pdfs。在银行业和金融业,这个数量被称为人口稳定指数,用来评估随着时间的推移模型特征的分布变化。


Principle of minimum discrimination information

An alternative is given via the [math]\displaystyle{ \lambda }[/math] divergence,

另一个选择是通过 lambda 分歧给出的,

The idea of Kullback–Leibler divergence as discrimination information led Kullback to propose the Principle of 模板:Visible anchor (MDI): given new facts, a new distribution [math]\displaystyle{ f }[/math] should be chosen which is as hard to discriminate from the original distribution [math]\displaystyle{ f_0 }[/math] as possible; so that the new data produces as small an information gain [math]\displaystyle{ D_\text{KL}(f \parallel f_0) }[/math] as possible.


[math]\displaystyle{ D_\lambda(P \parallel Q) = \lambda D_\text{KL}(P \parallel \lambda P + (1 - \lambda)Q) + (1 - \lambda) D_\text{KL}(Q \parallel \lambda P + (1 - \lambda)Q), }[/math]

< math > d _ lambda (p 并行 q) = lambda d _ text { KL }(p 并行 lambda p + (1-lambda) q) + (1-lambda) d _ text { KL }(q 并行 lambda p + (1-lambda) q) ,</math >

For example, if one had a prior distribution [math]\displaystyle{ p(x,a) }[/math] over [math]\displaystyle{ x }[/math] and [math]\displaystyle{ a }[/math], and subsequently learnt the true distribution of [math]\displaystyle{ a }[/math] was [math]\displaystyle{ u(a) }[/math], then the Kullback–Leibler divergence between the new joint distribution for [math]\displaystyle{ x }[/math] and [math]\displaystyle{ a }[/math], [math]\displaystyle{ q(x\mid a)u(a) }[/math], and the earlier prior distribution would be:


which can be interpreted as the expected information gain about [math]\displaystyle{ X }[/math] from discovering which probability distribution [math]\displaystyle{ X }[/math] is drawn from, [math]\displaystyle{ P }[/math] or [math]\displaystyle{ Q }[/math], if they currently have probabilities [math]\displaystyle{ \lambda }[/math] and [math]\displaystyle{ 1-\lambda }[/math] respectively.

这可以被解释为预期的信息收益 < math > x </math > 从发现哪个概率分布 < math > x </math > 是从,< math > p </math > 或 < math > q </math > ,如果他们目前有概率 < math > lambda </math > 和 < math > 1-lambda </math > 分别。

[math]\displaystyle{ D_\text{KL}(q(x \mid a)u(a) \parallel p(x, a)) = \operatorname{E}_{u(a)}\left\{D_\text{KL}(q(x \mid a) \parallel p(x \mid a))\right\} + D_\text{KL}(u(a) \parallel p(a)), }[/math]


The value [math]\displaystyle{ \lambda = 0.5 }[/math] gives the Jensen–Shannon divergence, defined by

值 < math > lambda = 0.5 </math > 给出 Jensen-Shannon 发散度,定义为

i.e. the sum of the Kullback–Leibler divergence of [math]\displaystyle{ p(a) }[/math] the prior distribution for [math]\displaystyle{ a }[/math] from the updated distribution [math]\displaystyle{ u(a) }[/math], plus the expected value (using the probability distribution [math]\displaystyle{ u(a) }[/math]) of the Kullback–Leibler divergence of the prior conditional distribution [math]\displaystyle{ p(x\mid a) }[/math] from the new conditional distribution [math]\displaystyle{ q(x\mid a) }[/math]. (Note that often the later expected value is called the conditional Kullback–Leibler divergence (or conditional relative entropy) and denoted by [math]\displaystyle{ D_\text{KL}(q(x\mid a) \parallel p(x\mid a)) }[/math]

模板:R[14]:p. 22) This is minimized if [math]\displaystyle{ q(x\mid a)=p(x\mid a) }[/math] over the whole support of [math]\displaystyle{ u(a) }[/math]; and we note that this result incorporates Bayes' theorem, if the new distribution [math]\displaystyle{ u(a) }[/math] is in fact a δ function representing certainty that [math]\displaystyle{ a }[/math] has one particular value.

[math]\displaystyle{ M = \frac{1}{2}(P + Q). }[/math]

< math > m = frac {1}{2}(p + q) </math >


MDI can be seen as an extension of Laplace's Principle of Insufficient Reason, and the Principle of Maximum Entropy of E.T. Jaynes. In particular, it is the natural extension of the principle of maximum entropy from discrete to continuous distributions, for which Shannon entropy ceases to be so useful (see differential entropy), but the Kullback–Leibler divergence continues to be just as relevant.

[math]\displaystyle{ D_{JS} }[/math] can also be interpreted as the capacity of a noisy information channel with two inputs giving the output distributions [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math]. The Jensen–Shannon divergence, like all f-divergences, is locally proportional to the Fisher information metric. It is similar to the Hellinger metric (in the sense that induces the same affine connection on a statistical manifold).

也可以解释为两个输入给出输出分布的噪声信息通道的容量。Jensen-Shannon 散度,像所有 f 发散一样,是局部成正比的费雪资讯度量。它类似于 Hellinger 度量(在统计流形上引入相同的仿射联络)。


In the engineering literature, MDI is sometimes called the Principle of Minimum Cross-Entropy (MCE) or Minxent for short. Minimising the Kullback–Leibler divergence from [math]\displaystyle{ m }[/math] to [math]\displaystyle{ p }[/math] with respect to [math]\displaystyle{ m }[/math] is equivalent to minimizing the cross-entropy of [math]\displaystyle{ p }[/math] and [math]\displaystyle{ m }[/math], since


[math]\displaystyle{ \Eta(p, m) = \Eta(p) + D_\text{KL}(p \parallel m), }[/math]

There are many other important measures of probability distance. Some of these are particularly connected with the Kullback–Leibler divergence. For example:

还有许多其他重要的概率距离度量方法。其中一些特别与 Kullback-Leibler 分歧有关。例如:


which is appropriate if one is trying to choose an adequate approximation to [math]\displaystyle{ p }[/math]. However, this is just as often not the task one is trying to achieve. Instead, just as often it is [math]\displaystyle{ m }[/math] that is some fixed prior reference measure, and [math]\displaystyle{ p }[/math] that one is attempting to optimise by minimising [math]\displaystyle{ D_\text{KL}(p \parallel m) }[/math] subject to some constraint. This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by redefining cross-entropy to be [math]\displaystyle{ D_\text{KL}(p \parallel m) }[/math], rather than [math]\displaystyle{ \Eta(p,m) }[/math].


Relationship to available work

Other notable measures of distance include the Hellinger distance, histogram intersection, Chi-squared statistic, quadratic form distance, match distance, Kolmogorov–Smirnov distance, and earth mover's distance.

其他值得注意的距离度量包括 Hellinger 距离、直方图交集、 Chi-squared 统计、二次形式距离、匹配距离、 Kolmogorov-Smirnov 距离和推土机距离。

文件:ArgonKLdivergence.png
Pressure versus volume plot of available work from a mole of argon gas relative to ambient, calculated as [math]\displaystyle{ T_o }[/math] times the Kullback–Leibler divergence.

Surprisals[15] add where probabilities multiply. The surprisal for an event of probability [math]\displaystyle{ p }[/math] is defined as [math]\displaystyle{ s = k \ln(1 / p) }[/math]. If [math]\displaystyle{ k }[/math] is [math]\displaystyle{ \left\{1, 1/\ln 2, 1.38 \times 10^{-23}\right\} }[/math] then surprisal is in [math]\displaystyle{ \{ }[/math]nats, bits, or [math]\displaystyle{ J/K\} }[/math] so that, for instance, there are [math]\displaystyle{ N }[/math] bits of surprisal for landing all "heads" on a toss of [math]\displaystyle{ N }[/math] coins.


Best-guess states (e.g. for atoms in a gas) are inferred by maximizing the average surprisal [math]\displaystyle{ S }[/math] (entropy) for a given set of control parameters (like pressure [math]\displaystyle{ P }[/math] or volume [math]\displaystyle{ V }[/math]). This constrained entropy maximization, both classically[16] and quantum mechanically,[17] minimizes Gibbs availability in entropy units[18] [math]\displaystyle{ A \equiv -k\ln(Z) }[/math] where [math]\displaystyle{ Z }[/math] is a constrained multiplicity or partition function.

Just as absolute entropy serves as theoretical background for data compression, relative entropy serves as theoretical background for data differencing – the absolute entropy of a set of data in this sense being the data required to reconstruct it (minimum compressed size), while the relative entropy of a target set of data, given a source set of data, is the data required to reconstruct the target given the source (minimum size of a patch).

正如绝对熵作为数据压缩的理论背景,相对熵作为数据差分的理论背景-一组数据的绝对熵在这个意义上是重建它所需要的数据(最小压缩大小) ,而一组目标数据的相对熵,给定一组源数据,是重建目标数据所需要的数据(最小补丁大小)。


When temperature [math]\displaystyle{ T }[/math] is fixed, free energy ([math]\displaystyle{ T \times A }[/math]) is also minimized. Thus if [math]\displaystyle{ T, V }[/math] and number of molecules [math]\displaystyle{ N }[/math] are constant, the Helmholtz free energy [math]\displaystyle{ F \equiv U - TS }[/math] (where [math]\displaystyle{ U }[/math] is energy) is minimized as a system "equilibrates." If [math]\displaystyle{ T }[/math] and [math]\displaystyle{ P }[/math] are held constant (say during processes in your body), the Gibbs free energy [math]\displaystyle{ G = U + PV - TS }[/math] is minimized instead. The change in free energy under these conditions is a measure of available work that might be done in the process. Thus available work for an ideal gas at constant temperature [math]\displaystyle{ T_o }[/math] and pressure [math]\displaystyle{ P_o }[/math] is [math]\displaystyle{ W = \Delta G = NkT_o\Theta(V/V_o) }[/math] where [math]\displaystyle{ V_o = NkT_o/P_o }[/math] and [math]\displaystyle{ \Theta(x) = x - 1 - \ln x \ge 0 }[/math] (see also Gibbs inequality).


More generally[19] the work available relative to some ambient is obtained by multiplying ambient temperature [math]\displaystyle{ T_o }[/math] by Kullback–Leibler divergence or net surprisal [math]\displaystyle{ \Delta I \ge 0, }[/math] defined as the average value of [math]\displaystyle{ k\ln(p/p_o) }[/math] where [math]\displaystyle{ p_o }[/math] is the probability of a given state under ambient conditions. For instance, the work available in equilibrating a monatomic ideal gas to ambient values of [math]\displaystyle{ V_o }[/math] and [math]\displaystyle{ T_o }[/math] is thus [math]\displaystyle{ W = T_o \Delta I }[/math], where Kullback–Leibler divergence


[math]\displaystyle{ \Delta I = Nk\left[\Theta\left(\frac{V}{V_o}\right) + \frac{3}{2}\Theta\left(\frac{T}{T_o}\right)\right]. }[/math]


The resulting contours of constant Kullback–Leibler divergence, shown at right for a mole of Argon at standard temperature and pressure, for example put limits on the conversion of hot to cold as in flame-powered air-conditioning or in the unpowered device to convert boiling-water to ice-water discussed here.[20] Thus Kullback–Leibler divergence measures thermodynamic availability in bits.


Quantum information theory

For density matrices [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] on a Hilbert space, the K–L divergence (or quantum relative entropy as it is often called in this case) from [math]\displaystyle{ Q }[/math] to [math]\displaystyle{ P }[/math] is defined to be


[math]\displaystyle{ D_\text{KL}(P \parallel Q) = \operatorname{Tr}(P(\log(P) - \log(Q))). }[/math]


In quantum information science the minimum of [math]\displaystyle{ D_\text{KL}(P\parallel Q) }[/math] over all separable states [math]\displaystyle{ Q }[/math] can also be used as a measure of entanglement in the state [math]\displaystyle{ P }[/math].


Relationship between models and reality

模板:Further


Just as Kullback–Leibler divergence of "actual from ambient" measures thermodynamic availability, Kullback–Leibler divergence of "reality from a model" is also useful even if the only clues we have about reality are some experimental measurements. In the former case Kullback–Leibler divergence describes distance to equilibrium or (when multiplied by ambient temperature) the amount of available work, while in the latter case it tells you about surprises that reality has up its sleeve or, in other words, how much the model has yet to learn.


Although this tool for evaluating models against systems that are accessible experimentally may be applied in any field, its application to selecting a statistical model via Akaike information criterion are particularly well described in papers[21] and a book[22] by Burnham and Anderson. In a nutshell the Kullback–Leibler divergence of reality from a model may be estimated, to within a constant additive term, by a function of the deviations observed between data and the model's predictions (like the mean squared deviation) . Estimates of such divergence for models that share the same additive term can in turn be used to select among models.


When trying to fit parametrized models to data there are various estimators which attempt to minimize Kullback–Leibler divergence, such as maximum likelihood and maximum spacing estimators.[citation needed]


Symmetrised divergence

Kullback and Leibler themselves actually defined the divergence as:


[math]\displaystyle{ D_\text{KL}(P \parallel Q) + D_\text{KL}(Q \parallel P) }[/math]


which is symmetric and nonnegative. This quantity has sometimes been used for feature selection in classification problems, where [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] are the conditional pdfs of a feature under two different classes. In the Banking and Finance industries, this quantity is referred to as Population Stability Index, and is used to assess distributional shifts in model features through time.

Category:Entropy and information

类别: 熵和信息


Category:F-divergences

分类: f 分歧

An alternative is given via the [math]\displaystyle{ \lambda }[/math] divergence,

Category:Information geometry

类别: 信息几何


Category:Thermodynamics

分类: 热力学


This page was moved from wikipedia:en:Kullback–Leibler divergence. Its edit history can be viewed at 相对熵/edithistory

  1. {{cite journal In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy) is a measure of how one probability distribution is different from a second, reference probability distribution. Applications include characterizing the relative (Shannon) entropy in information systems, randomness in continuous time-series, and information gain when comparing statistical models of inference. In contrast to variation of information, it is a distribution-wise asymmetric measure and thus does not qualify as a statistical metric of spread - it also does not satisfy the triangle inequality. In the simple case, a Kullback–Leibler divergence of 0 indicates that the two distributions in question are identical. In simplified terms, it is a measure of surprise, with diverse applications such as applied statistics, fluid mechanics, neuroscience and machine learning. 在数理统计学中,Kullback-Leibler 差异(也称为相对熵差异)是对一个概率分布和另一个参考概率分布差异的度量。应用包括描述信息系统中的相对(香农)熵,连续时间序列中的随机性,以及比较统计推断模型时的信息增益。与信息的变化不同,它是一个分布式的不对称度量,因此不能作为一个传播的统计度量——它也不能满足三角不等式。在简单的情况下,Kullback-Leibler 散度为0表明这两种分布是相同的。简而言之,它是一种出人意料的测量方法,应用广泛,如应用统计学、流体力学、神经科学和机器学习。 |last1=Kullback |first1=S. |authorlink1=Solomon Kullback |last2=Leibler |first2=R.A. |authorlink2=Richard Leibler |year = 1951 Let's consider two distributions of probability [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math]. Usually, [math]\displaystyle{ P }[/math] represents the data, the observations, or a probability distribution precisely measured. Distribution [math]\displaystyle{ Q }[/math] represents instead a theory, a model, a description or an approximation of [math]\displaystyle{ P }[/math]. The Kullback–Leibler divergence is then interpreted as the average difference of the number of bits required for encoding samples of [math]\displaystyle{ P }[/math] using a code optimized for [math]\displaystyle{ Q }[/math] rather than one optimized for [math]\displaystyle{ P }[/math]. 让我们考虑两个概率分布: p </math > 和 < math > q </math > 。通常,p 代表的是数据、观测值或精确测量的概率分布。分布 < math > q </math > 代表的是一个理论,一个模型,一个描述或者是 < math > p </math > 的近似值。然后,Kullback-Leibler 分歧被解释为使用为 < math > q </math > 而不是为 < math > p </math > 优化的代码对 < math > p </math > 样本进行编码所需的位数的平均差异。 |title = On information and sufficiency |journal = Annals of Mathematical Statistics |volume = 22 |issue=1 The Kullback–Leibler divergence was introduced by Solomon Kullback and Richard Leibler in 1951 as the directed divergence between two distributions; Kullback preferred the term discrimination information. The divergence is discussed in Kullback's 1959 book, Information Theory and Statistics. to be Kullback-Leibler 分歧是由 Solomon Kullback 和 Richard Leibler 于1951年提出的,作为两种分布之间的定向分歧; Kullback 倾向于使用歧视信息这一术语。Kullback 在1959年出版的《信息理论与统计学》(Information Theory and Statistics)一书中讨论了这种分歧。将来 |pages=79–86 |doi=10.1214/aoms/1177729694 |mr=39968 [math]\displaystyle{ D_\text{KL}(P \parallel Q) = \sum_{x\in\mathcal{X}} P(x) \log\left(\frac{P(x)}{Q(x)}\right). }[/math] < math > d text { KL }(p 并行 q) = sum _ { x in mathcal { x } p (x) log left (frac { p (x)}{ q (x)} right) . </math > |jstor=2236703 |doi-access=free}}
  2. 2.0 2.1 2.2 Kullback, S. (1959), Information Theory and Statistics, John Wiley & Sons. Republished by Dover Publications in 1968; reprinted in 1978: .
  3. [[Solomon Kullback [math]\displaystyle{ \lim_{x \to 0^{+}} x \log(x) = 0. }[/math] < math > lim _ { x to 0 ^ { + } x log (x) = 0. </math >|Kullback, S.]] (1987). "Letter to the Editor: The Kullback–Leibler distance For distributions [math]\displaystyle{ P }[/math] and [math]\displaystyle{ Q }[/math] of a continuous random variable, the Kullback–Leibler divergence is defined to be the integral: 对于连续型随机变量的分布 < math > p </math > 和 < math > q </math > ,Kullback-Leibler 散度被定义为整数:". The American Statistician. 41 (4): 340–341 [math]\displaystyle{ D_\text{KL}(P \parallel Q) = \int_{-\infty}^\infty p(x) \log\left(\frac{p(x)}{q(x)}\right)\, dx }[/math] (p 并行 q) = int _ {-infty } ^ infty p (x) log left (frac { p (x)}{ q (x)} right) ,dx </math >. doi:[//doi.org/10.1080%2F00031305.1987.10475510%0A%0Awhere+%7F%27%22%60UNIQ--math-0000029F-QINU%60%22%27%7F+and+%7F%27%22%60UNIQ--math-000002A0-QINU%60%22%27%7F+denote+the+probability+densities+of+%7F%27%22%60UNIQ--math-000002A1-QINU%60%22%27%7F+and+%7F%27%22%60UNIQ--math-000002A2-QINU%60%22%27%7F.%0A%0A%E8%BF%99%E9%87%8C+%3C+math+%3E+%E5%92%8C+%3C+math+%3E+q+%3C%2Fmath+%3E+%E8%A1%A8%E7%A4%BA+%3C+math+%3E+p+%3C%2Fmath+%3E+%E5%92%8C+%3C+math+%3E+q+%3C%2Fmath+%3E+%E7%9A%84%E6%A6%82%E7%8E%87%E5%AF%86%E5%BA%A6%E3%80%82 10.1080/00031305.1987.10475510 where '"`UNIQ--math-0000029F-QINU`"' and '"`UNIQ--math-000002A0-QINU`"' denote the probability densities of '"`UNIQ--math-000002A1-QINU`"' and '"`UNIQ--math-000002A2-QINU`"'. 这里 < math > 和 < math > q </math > 表示 < math > p </math > 和 < math > q </math > 的概率密度。] Check |doi= value (help). JSTOR 2684769. line feed character in |doi= at position 31 (help); line feed character in |title= at position 52 (help); line feed character in |pages= at position 8 (help); line feed character in |authorlink= at position 17 (help); Check |author-link1= value (help)
  4. MacKay, David J.C. (2003). Information Theory, Inference, and Learning Algorithms (First ed.). Cambridge University Press. p. 34. ISBN 9780521642989. https://books.google.com/books?id=AKuMj4PN_EMC&printsec=frontcover#v=onepage&q=%22Kullback%E2%80%93Leibler%20divergence%22&f=false. 
  5. Bishop C. (2006). Pattern Recognition and Machine Learning
  6. Burnham, K. P.; Anderson, D. R. (2002). Model Selection and Multi-Model Inference (2nd ed.). Springer. p. 51. ISBN 9780387953649. https://archive.org/details/modelselectionmu0000burn. 
  7. Hobson, Arthur (1971). Concepts in statistical mechanics.. New York: Gordon and Breach. ISBN 978-0677032405. 
  8. Sanov, I.N. (1957). "On the probability of large deviations of random magnitudes". Mat. Sbornik. 42 (84): 11–44.
  9. Novak S.Y. (2011), Extreme Value Methods with Applications to Finance ch. 14.5 (Chapman & Hall). .
  10. See the section "differential entropy – 4" in Relative Entropy video lecture by Sergio Verdú NIPS 2009
  11. Duchi J., "Derivations for Linear Algebra and Optimization".
  12. Chaloner, K.; Verdinelli, I. (1995). "Bayesian experimental design: a review". Statistical Science. 10 (3): 273–304. doi:10.1214/ss/1177009939.
  13. Press, W.H.; Teukolsky, S.A.; Vetterling, W.T.; Flannery, B.P. (2007). "Section 14.7.2. Kullback–Leibler Distance". Numerical Recipes: The Art of Scientific Computing (3rd ed.). Cambridge University Press. ISBN 978-0-521-88068-8. http://apps.nrbook.com/empanel/index.html#pg=756. 
  14. Thomas, Joy A. (1991), Elements of Information Theory where [math]\displaystyle{ M }[/math] is the average of the two distributions, 其中 m 是两个分布的平均值, John Wiley & Sons Unknown parameter |[math]\displaystyle{ D_\text{JS} = \frac{1}{2} D_\text{KL} (P \parallel M) + \frac{1}{2} D_\text{KL}(Q \parallel M) }[/math] {2} d text {1}{2} d text { KL }(p 并行 m) + frac {1}{2} d text { KL }(q 并行 m) </math > last1= ignored (help); line feed character in |title= at position 31 (help); |first1= missing |last1= (help)
  15. Myron Tribus (1961), Thermodynamics and Thermostatics (D. Van Nostrand, New York)
  16. Jaynes, E. T. (1957). "Information theory and statistical mechanics" (PDF). Physical Review. 106 (4): 620–630. Bibcode:1957PhRv..106..620J. doi:10.1103/physrev.106.620.
  17. Jaynes, E. T. (1957). "Information theory and statistical mechanics II" (PDF). Physical Review. 108 (2): 171–190. Bibcode:1957PhRv..108..171J. doi:10.1103/physrev.108.171.
  18. J.W. Gibbs (1873), "A method of geometrical representation of thermodynamic properties of substances by means of surfaces", reprinted in The Collected Works of J. W. Gibbs, Volume I Thermodynamics, ed. W. R. Longley and R. G. Van Name (New York: Longmans, Green, 1931) footnote page 52.
  19. Tribus, M.; McIrvine, E. C. (1971). "Energy and information". Scientific American. 224 (3): 179–186. Bibcode:1971SciAm.225c.179T. doi:10.1038/scientificamerican0971-179.
  20. Fraundorf, P. (2007). "Thermal roots of correlation-based complexity". Complexity. 13 (3): 18–26. arXiv:1103.2481. Bibcode:2008Cmplx..13c..18F. doi:10.1002/cplx.20195. Archived from the original on 2011-08-13.
  21. Burnham, K.P.; Anderson, D.R. (2001). "Kullback–Leibler information as a basis for strong inference in ecological studies". Wildlife Research. 28 (2): 111–119. doi:10.1071/WR99107.
  22. Burnham, K. P. and Anderson D. R. (2002), Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach, Second Edition (Springer Science) .