微分熵

微分熵 Differential entropy（也被称为连续熵 continuous entropy）是信息论中的一个概念，其来源于香农尝试将他的香农熵的概念扩展到连续的概率分布。香农熵是衡量一个随机变量的平均惊异程度的指标。可惜的是，香农只是假设它是离散熵的正确连续模拟而并没有推导出公式，但事实上它并不是离散熵的正确连续模拟。^[1]离散熵的实际连续版本是离散点的极限密度 limiting density of discrete points（LDDP）。微分熵（此处描述）在文献中很常见，但它是 LDDP 的一个极限情况，并且失去了与离散熵的基本联系。

定义

Let [math]\displaystyle{ X }[/math] be a random variable with a probability density function [math]\displaystyle{ f }[/math] whose support is a set [math]\displaystyle{ \mathcal X }[/math]. The differential entropy [math]\displaystyle{ h(X) }[/math] or [math]\displaystyle{ h(f) }[/math] is defined as

设随机变量[math]\displaystyle{ X }[/math]，其概率密度函数[math]\displaystyle{ f }[/math]的的定义域是[math]\displaystyle{ \mathcal X }[/math]的集合。该微分熵 [math]\displaystyle{ h(X) }[/math] 或者[math]\displaystyle{ h(f) }[/math]定义为 ^[2]

[math]\displaystyle{ h(X) = -\int_\mathcal{X} f(x)\log f(x)\,dx }[/math]

对于没有显式密度函数表达式，但有显式分位数函数表达式的概率分布，[math]\displaystyle{ Q(p) }[/math]，则[math]\displaystyle{ h(Q) }[/math]可以用导数[math]\displaystyle{ Q(p) }[/math]来定义，即分位数密度函数[math]\displaystyle{ Q'(p) }[/math]^[3]

[math]\displaystyle{ h(Q) = \int_0^1 \log Q'(p)\,dp }[/math].

与离散模型一样，微分熵的单位取决于对数的底数，通常是2（单位：比特；请参阅对数单位，了解不同基数的对数。）相关概念，如联合熵、条件微分熵和相对熵，以类似的方式定义。与离散模型不同，微分熵的偏移量取决于测量单位。^[4]例如，以毫米为单位的量的微分熵将比以米为单位测量的相同量的微分熵大 log(1000)；无量纲量的log(1000)微分熵将大于相同量除以1000。

在尝试将离散熵的性质应用于微分熵时必须小心，因为概率密度函数可以大于1。例如，均匀分布[math]\displaystyle{ \mathcal{U}(0,1/2) }[/math]具有“负”微分熵

[math]\displaystyle{ \int_0^\frac{1}{2} -2\log(2)\,dx=-\log(2)\, }[/math].

因此，微分熵并不具有离散熵的所有性质。

注意，连续互信息[math]\displaystyle{ I(X;Y) }[/math] 具有保留其作为离散信息度量的基本意义的区别，因为它实际上是X和Y的“分区”的离散互信息的极限，因为这些分区变得越来越细。因此，它在非线性同胚（连续且唯一可逆的映射）下是不变的，^[5]包括线性^[6]变换[math]\displaystyle{ X }[/math]和[math]\displaystyle{ Y }[/math]，并且仍然表示可在允许连续值空间的信道上传输的离散信息量。

对于扩展到连续空间的离散熵的直接模拟，参见离散点的极限密度。

微分熵的性质

对于概率密度[math]\displaystyle{ f }[/math]和[math]\displaystyle{ g }[/math]，仅当[math]\displaystyle{ f=g }[/math]几乎处处成立时,Kullback–Leibler散度[math]\displaystyle{ D_{KL}(f || g) }[/math]才大于或等于0。类似地，对于两个随机变量[math]\displaystyle{ X }[/math]和[math]\displaystyle{ Y }[/math]，当且仅当[math]\displaystyle{ X }[/math]和[math]\displaystyle{ Y }[/math]是独立，[math]\displaystyle{ I(X;Y) \ge 0 }[/math]才和[math]\displaystyle{ h(X|Y) \le h(X) }[/math]相等。

微分熵的链式法则在离散情况下成立^[2]

[math]\displaystyle{ h(X_1, \ldots, X_n) = \sum_{i=1}^{n} h(X_i|X_1, \ldots, X_{i-1}) \leq \sum_{i=1}^{n} h(X_i) }[/math].

微分熵是平移不变的，即对于常数[math]\displaystyle{ c }[/math]存在。^[2]

[math]\displaystyle{ h(X+c) = h(X) }[/math]

在任意可逆映射下，微分熵通常不是不变的。

特别地，对于一个常数a，

[math]\displaystyle{ h(aX) = h(X)+ \log |a| }[/math]

对于向量值随机变量[math]\displaystyle{ \mathbf{X} }[/math]和可逆（平方）矩阵[math]\displaystyle{ \mathbf{A} }[/math]

[math]\displaystyle{ h(\mathbf{A}\mathbf{X})=h(\mathbf{X})+\log \left( |\det \mathbf{A}| \right) }[/math]^[2]

一般地，对于从一个随机向量到另一个具有相同维数的随机向量的变换[math]\displaystyle{ \mathbf{Y}=m \left(\mathbf{X}\right) }[/math]，相应的熵通过

[math]\displaystyle{ h(\mathbf{Y}) \leq h(\mathbf{X}) + \int f(x) \log \left\vert \frac{\partial m}{\partial x} \right\vert dx }[/math]

其中[math]\displaystyle{ \left\vert \frac{\partial m}{\partial x} \right\vert }[/math]是变换的 Jacobian[math]\displaystyle{ m }[/math]。^[7]如果变换是双射，则上述不等式变为等式。此外，当[math]\displaystyle{ m }[/math]是刚性旋转、平移或其组合时，雅可比行列式总是1，并且[math]\displaystyle{ h(Y)=h(X) }[/math]。

如果一个随机向量X具有均值零和协方差矩阵[math]\displaystyle{ K }[/math], [math]\displaystyle{ h(\mathbf{X}) \leq \frac{1}{2} \log(\det{2 \pi e K}) = \frac{1}{2} \log[(2\pi e)^n \det{K}] }[/math]相等当且仅当[math]\displaystyle{ X }[/math]为多元正态分布/联合正态性/联合高斯（见下文[[#正态分布中的最大化]）。^[2]

然而，微分熵没有其他理想的特性：

它在变量变化下不是不变的，因此对无量纲变量最有用。
它可以为负。

解决这些缺点的微分熵的一种改进是“相对信息熵”，也称为Kullback–Leibler散度，它包括一个不变的测度因子（参见：离散点的极限密度）。

正态分布中的最大化

定理

对于正态分布，对于给定的方差，微分熵是最大的。在所有等方差随机变量中，高斯随机变量的熵最大，或者在均值和方差约束下的最大熵分布是高斯分布。^[2]

证明

设[math]\displaystyle{ g（x） }[/math]是一个正态分布的概率密度函数，具有均值μ和方差[math]\displaystyle{ \sigma^2 }[/math]和[math]\displaystyle{ f（x） }[/math]具有相同方差的任意概率密度函数。由于微分熵是平移不变性的，我们可以假设[math]\displaystyle{ f（x） }[/math]具有相同的均值[math]\displaystyle{ \mu }[/math]作为[math]\displaystyle{ g（x） }[/math]。

考虑两个分布之间的Kullback–Leibler散度

[math]\displaystyle{ 0 \leq D_{KL}(f || g) = \int_{-\infty}^\infty f(x) \log \left( \frac{f(x)}{g(x)} \right) dx = -h(f) - \int_{-\infty}^\infty f(x)\log(g(x)) dx. }[/math]

现在请注意

[math]\displaystyle{ \begin{align} \int_{-\infty}^\infty f(x)\log(g(x)) dx &= \int_{-\infty}^\infty f(x)\log\left( \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\right) dx \\ &= \int_{-\infty}^\infty f(x) \log\frac{1}{\sqrt{2\pi\sigma^2}} dx + \log(e)\int_{-\infty}^\infty f(x)\left( -\frac{(x-\mu)^2}{2\sigma^2}\right) dx \\ &= -\tfrac{1}{2}\log(2\pi\sigma^2) - \log(e)\frac{\sigma^2}{2\sigma^2} \\ &= -\tfrac{1}{2}\left(\log(2\pi\sigma^2) + \log(e)\right) \\ &= -\tfrac{1}{2}\log(2\pi e \sigma^2) \\ &= -h(g) \end{align} }[/math]

--CecileLi(讨论) 【审校】补充翻译：

：[math]\displaystyle{ 0\leq D{KL}（f{g）=\int{-\infty}^\infty f（x）\log\left（\frac{f（x）}{g（x）}\right）dx=-h（f）-\int{-\infty}^\infty f（x）\log（g（x））dx。 }[/math]

现在请注意

[math]\displaystyle{ \begin{align} \int_{-\infty}^\infty f(x)\log(g(x)) dx &= \int_{-\infty}^\infty f(x)\log\left( \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\right) dx \\ &= \int_{-\infty}^\infty f(x) \log\frac{1}{\sqrt{2\pi\sigma^2}} dx + \log(e)\int_{-\infty}^\infty f(x)\left( -\frac{(x-\mu)^2}{2\sigma^2}\right) dx \\ &= -\tfrac{1}{2}\log(2\pi\sigma^2) - \log(e)\frac{\sigma^2}{2\sigma^2} \\ &= -\tfrac{1}{2}\left(\log(2\pi\sigma^2) + \log(e)\right) \\ &= -\tfrac{1}{2}\log(2\pi e \sigma^2) \\ &= -h(g) \end{align} }[/math]

因为结果不依赖于[math]\displaystyle{ f(x) }[/math]而不是通过方差。将这两个结果结合起来就得到了

[math]\displaystyle{ h(g) - h(f) \geq 0 \! }[/math]

当[math]\displaystyle{ f(x)=g(x) }[/math]遵循Kullback-Leibler散度的性质时相等。

替代证明

这个结果也可以用变分演算来证明。具有两个拉格朗日乘子的拉格朗日函数可定义为：

[math]\displaystyle{ L=\int_{-\infty}^\infty g(x)\ln(g(x))\,dx-\lambda_0\left(1-\int_{-\infty}^\infty g(x)\,dx\right)-\lambda\left(\sigma^2-\int_{-\infty}^\infty g(x)(x-\mu)^2\,dx\right) }[/math]

其中g(x)是平均μ的函数。当g(x)的熵为最大值时，由归一化条件[math]\displaystyle{ \ left（1=\int{-\infty}^\infty g（x）\，dx\ right） }[/math]和固定方差[math]\displaystyle{ \left(\sigma^2=\int_{-\infty}^\infty g(x)(x-\mu)^2\,dx\right) }[/math]组成的约束方程均满足，然后，关于g(x)的微小变化δg(x)将产生关于L的变化δL，其等于零：

[math]\displaystyle{ 0=\delta L=\int_{-\infty}^\infty \delta g(x)\left (\ln(g(x))+1+\lambda_0+\lambda(x-\mu)^2\right )\,dx }[/math]

由于这必须适用于任何小δg(x)，括号中的项必须为零，求解g(x)得到：

[math]\displaystyle{ g(x)=e^{-\lambda_0-1-\lambda(x-\mu)^2} }[/math]

使用约束方程求解λ₀和λ得出正态分布：

[math]\displaystyle{ g(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} }[/math]

例子：指数分布

设[math]\displaystyle{ X }[/math]为指数分布随机变量，参数为[math]\displaystyle{ \lambda }[/math]，即概率密度函数

[math]\displaystyle{ f(x) = \lambda e^{-\lambda x} \mbox{ for } x \geq 0. }[/math]

它的微分熵是

[math]\displaystyle{ h_e(X)\, }[/math]	[math]\displaystyle{ =-\int_0^\infty \lambda e^{-\lambda x} \log (\lambda e^{-\lambda x})\,dx }[/math]
	[math]\displaystyle{ = -\left(\int_0^\infty (\log \lambda)\lambda e^{-\lambda x}\,dx + \int_0^\infty (-\lambda x) \lambda e^{-\lambda x}\,dx\right) }[/math]
	[math]\displaystyle{ = -\log \lambda \int_0^\infty f(x)\,dx + \lambda E[X] }[/math]
	[math]\displaystyle{ = -\log\lambda + 1\,. }[/math]

这里，[math]\displaystyle{ h_e(X) }[/math]被使用而不是[math]\displaystyle{ h(X) }[/math]明确以e为底对数，以简化计算。

与估计器误差的关系

微分熵给出了估计量的期望平方误差的下界。对于任何随机变量[math]\displaystyle{ X }[/math]和估计器[math]\displaystyle{ \widehat{X} }[/math]来说，以下条件成立：^[2]

[math]\displaystyle{ \operatorname{E}[(X - \widehat{X})^2] \ge \frac{1}{2\pi e}e^{2h(X)} }[/math]

当且仅当[math]\displaystyle{ X }[/math]是高斯随机变量，[math]\displaystyle{ \widehat{X} }[/math]是[math]\displaystyle{ X }[/math]的平均值。

各种分布的微分熵

在下表中，[math]\displaystyle{ \Gamma(x) = \int_0^{\infty} e^{-t} t^{x-1} dt }[/math]是Gamma函数，[math]\displaystyle{ \psi(x) = \frac{d}{dx} \ln\Gamma(x)=\frac{\Gamma'(x)}{\Gamma(x)} }[/math]是digamma 函数，[math]\displaystyle{ B(p,q) = \frac{\Gamma(p)\Gamma(q)}{\Gamma(p+q)} }[/math]是beta函数，γ_E'是欧拉常数。^[8]

Table of differential entropies
分布	概率密度函数	信息自然单位中的熵	范围
连续均匀分布	[math]\displaystyle{ f(x) = \frac{1}{b-a} }[/math]	[math]\displaystyle{ \ln(b - a) \, }[/math]	[math]\displaystyle{ [a,b]\, }[/math]
正态分布	[math]\displaystyle{ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) }[/math]	[math]\displaystyle{ \ln\left(\sigma\sqrt{2\,\pi\,e}\right) }[/math]	[math]\displaystyle{ (-\infty,\infty)\, }[/math]
指数分布	[math]\displaystyle{ f(x) = \lambda \exp\left(-\lambda x\right) }[/math]	[math]\displaystyle{ 1 - \ln \lambda \, }[/math]	[math]\displaystyle{ [0,\infty)\, }[/math]
瑞利分布	[math]\displaystyle{ f(x) = \frac{x}{\sigma^2} \exp\left(-\frac{x^2}{2\sigma^2}\right) }[/math]	[math]\displaystyle{ 1 + \ln \frac{\sigma}{\sqrt{2}} + \frac{\gamma_E}{2} }[/math]	[math]\displaystyle{ [0,\infty)\, }[/math]
Beta分布	[math]\displaystyle{ f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha,\beta)} }[/math] for [math]\displaystyle{ 0 \leq x \leq 1 }[/math]	[math]\displaystyle{ \ln B(\alpha,\beta) - (\alpha-1)[\psi(\alpha) - \psi(\alpha +\beta)]\, }[/math] [math]\displaystyle{ - (\beta-1)[\psi(\beta) - \psi(\alpha + \beta)] \, }[/math]	[math]\displaystyle{ [0,1]\, }[/math]
柯西分布	[math]\displaystyle{ f(x) = \frac{\gamma}{\pi} \frac{1}{\gamma^2 + x^2} }[/math]	[math]\displaystyle{ \ln(4\pi\gamma) \, }[/math]	[math]\displaystyle{ (-\infty,\infty)\, }[/math]
Chi分布	[math]\displaystyle{ f(x) = \frac{2}{2^{k/2} \Gamma(k/2)} x^{k-1} \exp\left(-\frac{x^2}{2}\right) }[/math]	[math]\displaystyle{ \ln{\frac{\Gamma(k/2)}{\sqrt{2}}} - \frac{k-1}{2} \psi\left(\frac{k}{2}\right) + \frac{k}{2} }[/math]	[math]\displaystyle{ [0,\infty)\, }[/math]
卡方分布	[math]\displaystyle{ f(x) = \frac{1}{2^{k/2} \Gamma(k/2)} x^{\frac{k}{2}\!-\!1} \exp\left(-\frac{x}{2}\right) }[/math]	[math]\displaystyle{ \ln 2\Gamma\left(\frac{k}{2}\right) - \left(1 - \frac{k}{2}\right)\psi\left(\frac{k}{2}\right) + \frac{k}{2} }[/math]	[math]\displaystyle{ [0,\infty)\, }[/math]
Erlang分布	[math]\displaystyle{ f(x) = \frac{\lambda^k}{(k-1)!} x^{k-1} \exp(-\lambda x) }[/math]	[math]\displaystyle{ (1-k)\psi(k) + \ln \frac{\Gamma(k)}{\lambda} + k }[/math]	[math]\displaystyle{ [0,\infty)\, }[/math]
F分布	[math]\displaystyle{ f(x) = \frac{n_1^{\frac{n_1}{2}} n_2^{\frac{n_2}{2}}}{B(\frac{n_1}{2},\frac{n_2}{2})} \frac{x^{\frac{n_1}{2} - 1}}{(n_2 + n_1 x)^{\frac{n_1 + n2}{2}}} }[/math]	[math]\displaystyle{ \ln \frac{n_1}{n_2} B\left(\frac{n_1}{2},\frac{n_2}{2}\right) + \left(1 - \frac{n_1}{2}\right) \psi\left(\frac{n_1}{2}\right) - }[/math] [math]\displaystyle{ \left(1 + \frac{n_2}{2}\right)\psi\left(\frac{n_2}{2}\right) + \frac{n_1 + n_2}{2} \psi\left(\frac{n_1\!+\!n_2}{2}\right) }[/math]	[math]\displaystyle{ [0,\infty)\, }[/math]
Gamma分布	[math]\displaystyle{ f(x) = \frac{x^{k - 1} \exp(-\frac{x}{\theta})}{\theta^k \Gamma(k)} }[/math]	[math]\displaystyle{ \ln(\theta \Gamma(k)) + (1 - k)\psi(k) + k \, }[/math]	[math]\displaystyle{ [0,\infty)\, }[/math]
拉普拉斯分布	[math]\displaystyle{ f(x) = \frac{1}{2b} \exp\left(-\frac{\|x - \mu\|}{b}\right) }[/math]	[math]\displaystyle{ 1 + \ln(2b) \, }[/math]	[math]\displaystyle{ (-\infty,\infty)\, }[/math]
逻辑分布	[math]\displaystyle{ f(x) = \frac{e^{-x}}{(1 + e^{-x})^2} }[/math]	[math]\displaystyle{ 2 \, }[/math]	[math]\displaystyle{ (-\infty,\infty)\, }[/math]
对数正态分布	[math]\displaystyle{ f(x) = \frac{1}{\sigma x \sqrt{2\pi}} \exp\left(-\frac{(\ln x - \mu)^2}{2\sigma^2}\right) }[/math]	[math]\displaystyle{ \mu + \frac{1}{2} \ln(2\pi e \sigma^2) }[/math]	[math]\displaystyle{ [0,\infty)\, }[/math]
麦克斯韦-玻尔兹曼分布	[math]\displaystyle{ f(x) = \frac{1}{a^3}\sqrt{\frac{2}{\pi}}\,x^{2}\exp\left(-\frac{x^2}{2a^2}\right) }[/math]	[math]\displaystyle{ \ln(a\sqrt{2\pi})+\gamma_E-\frac{1}{2} }[/math]	[math]\displaystyle{ [0,\infty)\, }[/math]
广义正态分布	[math]\displaystyle{ f(x) = \frac{2 \beta^{\frac{\alpha}{2}}}{\Gamma(\frac{\alpha}{2})} x^{\alpha - 1} \exp(-\beta x^2) }[/math]	[math]\displaystyle{ \ln{\frac{\Gamma(\alpha/2)}{2\beta^{\frac{1}{2}}}} - \frac{\alpha - 1}{2} \psi\left(\frac{\alpha}{2}\right) + \frac{\alpha}{2} }[/math]	[math]\displaystyle{ (-\infty,\infty)\, }[/math]
Pareto分布	[math]\displaystyle{ f(x) = \frac{\alpha x_m^\alpha}{x^{\alpha+1}} }[/math]	[math]\displaystyle{ \ln \frac{x_m}{\alpha} + 1 + \frac{1}{\alpha} }[/math]	[math]\displaystyle{ [x_m,\infty)\, }[/math]
学生t分布	[math]\displaystyle{ f(x) = \frac{(1 + x^2/\nu)^{-\frac{\nu+1}{2}}}{\sqrt{\nu}B(\frac{1}{2},\frac{\nu}{2})} }[/math]	[math]\displaystyle{ \frac{\nu\!+\!1}{2}\left(\psi\left(\frac{\nu\!+\!1}{2}\right)\!-\!\psi\left(\frac{\nu}{2}\right)\right)\!+\!\ln \sqrt{\nu} B\left(\frac{1}{2},\frac{\nu}{2}\right) }[/math]	[math]\displaystyle{ (-\infty,\infty)\, }[/math]
三角分布	[math]\displaystyle{ f(x) = \begin{cases} \frac{2(x-a)}{(b-a)(c-a)} & \mathrm{for\ } a \le x \leq c, \\[4pt] \frac{2(b-x)}{(b-a)(b-c)} & \mathrm{for\ } c \lt x \le b, \\[4pt] \end{cases} }[/math]	[math]\displaystyle{ \frac{1}{2} + \ln \frac{b-a}{2} }[/math]	[math]\displaystyle{ [0,1]\, }[/math]
威布尔分布	[math]\displaystyle{ f(x) = \frac{k}{\lambda^k} x^{k-1} \exp\left(-\frac{x^k}{\lambda^k}\right) }[/math]	[math]\displaystyle{ \frac{(k-1)\gamma_E}{k} + \ln \frac{\lambda}{k} + 1 }[/math]	[math]\displaystyle{ [0,\infty)\, }[/math]
多元正态分布	[math]\displaystyle{ f_X(\vec{x}) = }[/math] [math]\displaystyle{ \frac{\exp \left( -\frac{1}{2} ( \vec{x} - \vec{\mu})^\top \Sigma^{-1}\cdot(\vec{x} - \vec{\mu}) \right)} {(2\pi)^{N/2} \left\|\Sigma\right\|^{1/2}} }[/math]	[math]\displaystyle{ \frac{1}{2}\ln\{(2\pi e)^{N} \det(\Sigma)\} }[/math]	[math]\displaystyle{ \mathbb{R}^N }[/math]

许多微分熵来自.^[9]

变体

如上所述，微分熵不具有离散熵的所有性质。例如，微分熵可以是负的；在连续坐标变换下也不是不变的。Edwin Thompson Jaynes事实上证明了上面的表达式不是有限概率的表达式的正确限制。^[10]

微分熵的修改增加了一个不变的度量因子来纠正这个问题，（见离散点的极限密度）。如果[math]\displaystyle{ m(x) }[/math]被进一步约束为概率密度，由此产生的概念在信息论中称为相对熵 relative entropy：

[math]\displaystyle{ D(p||m) = \int p(x)\log\frac{p(x)}{m(x)}\,dx. }[/math]

上面的微分熵的定义可以通过划分范围来获得[math]\displaystyle{ X }[/math]成箱的长度 {\displaystyle h}H 与相关的样本点[math]\displaystyle{ h }[/math]在垃圾箱内，对于[math]\displaystyle{ X }[/math]黎曼可积。这给出了一个量化的版本[math]\displaystyle{ X }[/math]，被定义为[math]\displaystyle{ X_h = ih }[/math] 如果[math]\displaystyle{ ih \le X \le (i+1)h }[/math]. 那么熵[math]\displaystyle{ X_h = ih }[/math]是^[2]

[math]\displaystyle{ H_h=-\sum_i hf(ih)\log (f(ih)) - \sum hf(ih)\log(h). }[/math]

右边的第一项近似于微分熵，而第二项近似于[math]\displaystyle{ -\log(h) }[/math]。请注意，此过程表明连续随机变量的离散意义上的熵应该是[math]\displaystyle{ \infty }[/math]。

参考文献

↑ Jaynes, E.T. (1963). "Information Theory And Statistical Mechanics" (PDF). Brandeis University Summer Institute Lectures in Theoretical Physics. 3 (sect. 4b).
↑ ^2.0 ^2.1 ^2.2 ^2.3 ^2.4 ^2.5 ^2.6 ^2.7 Cover, Thomas M.; Thomas, Joy A. (1991). Elements of Information Theory. New York: Wiley. ISBN 0-471-06259-6. https://archive.org/details/elementsofinform0000cove.
↑ Vasicek, Oldrich (1976), "A Test for Normality Based on Sample Entropy", Journal of the Royal Statistical Society, Series B, 38 (1), JSTOR 2984828.
↑ Gibbs, Josiah Willard (1902). Elementary Principles in Statistical Mechanics, developed with especial reference to the rational foundation of thermodynamics. New York: Charles Scribner's Sons.
↑ Kraskov, Alexander; Stögbauer, Grassberger (2004). "Estimating mutual information". Physical Review E. 60: 066138. arXiv:cond-mat/0305641. Bibcode:2004PhRvE..69f6138K. doi:10.1103/PhysRevE.69.066138.
↑ Fazlollah M. Reza (1994) [1961]. An Introduction to Information Theory. Dover Publications, Inc., New York. ISBN 0-486-68210-2. https://books.google.com/books?id=RtzpRAiX6OgC&pg=PA8&dq=intitle:%22An+Introduction+to+Information+Theory%22++%22entropy+of+a+simple+source%22&as_brr=0&ei=zP79Ro7UBovqoQK4g_nCCw&sig=j3lPgyYrC3-bvn1Td42TZgTzj0Q.
↑ "proof of upper bound on differential entropy of f(X)". Stack Exchange. April 16, 2016.
↑ Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (PDF). Journal of Econometrics. Elsevier. Archived from the original (PDF) on 2016-03-07. Retrieved 2011-06-02.
↑ Lazo, A. and P. Rathie (1978). "On the entropy of continuous probability distributions". IEEE Transactions on Information Theory. 24 (1): 120–122. doi:10.1109/TIT.1978.1055832.
↑ Jaynes, E.T. (1963). "Information Theory And Statistical Mechanics" (PDF). Brandeis University Summer Institute Lectures in Theoretical Physics. 3 (sect. 4b).

编者推荐

最令人绝望物理定律“熵增原理”：生命以负熵为食，最终走向消亡

集智课程

信息熵及其相关概念

信息熵常被用来作为一个系统的信息含量的量化指标，从而可以进一步用来作为系统方程优化的目标或者参数选择的判据。该文章用通俗易懂的语言系统地梳理一下有关熵的概念，有助于初学者入门。

最令人绝望物理定律“熵增原理”：生命以负熵为食，最终走向消亡

熵被认为一种悲观主义的世界观，而熵增定律也被认为是令全宇宙都绝望的定律。该文章介绍了其令人绝望的原因。

集智文章推荐

本中文词条由Henry用户参与编译，CecileLi审校，薄荷编辑，欢迎在讨论页面留言。

本词条内容源自wikipedia及公开资料，遵守 CC3.0协议。

[1] Jaynes, E.T. (1963). "Information Theory And Statistical Mechanics" (PDF). Brandeis University Summer Institute Lectures in Theoretical Physics. 3 (sect. 4b).

[cover_thomas-2] 2.0 ^2.1 ^2.2 ^2.3 ^2.4 ^2.5 ^2.6 ^2.7 Cover, Thomas M.; Thomas, Joy A. (1991). Elements of Information Theory. New York: Wiley. ISBN 0-471-06259-6. https://archive.org/details/elementsofinform0000cove.

[3] Vasicek, Oldrich (1976), "A Test for Normality Based on Sample Entropy", Journal of the Royal Statistical Society, Series B, 38 (1), JSTOR 2984828.

[gibbs-4] Gibbs, Josiah Willard (1902). Elementary Principles in Statistical Mechanics, developed with especial reference to the rational foundation of thermodynamics. New York: Charles Scribner's Sons.

[5] Kraskov, Alexander; Stögbauer, Grassberger (2004). "Estimating mutual information". Physical Review E. 60: 066138. arXiv:cond-mat/0305641. Bibcode:2004PhRvE..69f6138K. doi:10.1103/PhysRevE.69.066138.

[Reza-6] Fazlollah M. Reza (1994) [1961]. An Introduction to Information Theory. Dover Publications, Inc., New York. ISBN 0-486-68210-2. https://books.google.com/books?id=RtzpRAiX6OgC&pg=PA8&dq=intitle:%22An+Introduction+to+Information+Theory%22++%22entropy+of+a+simple+source%22&as_brr=0&ei=zP79Ro7UBovqoQK4g_nCCw&sig=j3lPgyYrC3-bvn1Td42TZgTzj0Q.

[7] "proof of upper bound on differential entropy of f(X)". Stack Exchange. April 16, 2016.

[8] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (PDF). Journal of Econometrics. Elsevier. Archived from the original (PDF) on 2016-03-07. Retrieved 2011-06-02.

[lazorathie-9] Lazo, A. and P. Rathie (1978). "On the entropy of continuous probability distributions". IEEE Transactions on Information Theory. 24 (1): 120–122. doi:10.1109/TIT.1978.1055832.

[10] Jaynes, E.T. (1963). "Information Theory And Statistical Mechanics" (PDF). Brandeis University Summer Institute Lectures in Theoretical Physics. 3 (sect. 4b).

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

微分熵

目录

定义

微分熵的性质

正态分布中的最大化

定理

证明

替代证明

例子：指数分布

与估计器误差的关系

各种分布的微分熵

变体

参考文献

编者推荐

集智课程

信息熵及其相关概念

最令人绝望物理定律“熵增原理”：生命以负熵为食，最终走向消亡

集智文章推荐

导航菜单

搜索