更改

条件熵 (查看源代码)

2021年1月17日 (日) 21:49的版本

删除8,495字节、 2021年1月17日 (日) 21:49

无编辑摘要

第1行：第1行：

此词条由Jie翻译。由Lincent审校。

−

~~{{Short description|Measure of relative information in probability theory}}~~

−

~~{{Information theory}}~~

[[文件:Entropy-mutual-information-relative-entropy-relation-diagram.svg|缩略图|右|该图表示在变量X、Y相关联的各种信息量之间，进行加减关系的维恩图。两个圆所包含的区域是联合熵H(X,Y)。左侧的圆圈（红色和紫色）是单个熵H（X），红色是条件熵H（X ǀ Y）。右侧的圆圈（蓝色和紫色）为H（Y），蓝色为H（Y ǀ X）。中间紫色的是相互信息i（X; Y）。]]

−

In [[information theory]], the '''conditional entropy''' quantifies the amount of information needed to describe the outcome of a [[random variable]] <math>Y</math> given that the value of another random variable <math>X</math> is known. Here, information is measured in [[Shannon (unit)|shannon]]s, [[Nat (unit)|nat]]s, or [[Hartley (unit)|hartley]]s. The ''entropy of <math>Y</math> conditioned on <math>X</math>'' is written as H（X ǀ Y）.

−

在~~'''~~信息论 Information theory~~'''~~中，假设随机变量<math>X</math>的值已知，那么'''~~~~条件熵 Conditional entropy~~~~'''则用于定量描述随机变量<math>Y</math>~~表示的信息量。此时，信息以'''香农~~ Shannon ~~~~'''，'''奈特 nat'''或'''哈特莱 hartley'''来衡量。已知<math>X</math>的条件下<math>Y</math>的熵记为<math>H（X ǀ Y）</math>。

+

在[[信息论 Information theory]]中，假设随机变量<math>X</math>的值已知，那么'''条件熵 Conditional entropy'''则用于定量描述随机变量<math>Y</math>表示的信息量。此时，信息以香农 Shannon'''，'''奈特 nat'''或'''哈特莱 hartley'''来衡量。已知<math>X</math>的条件下<math>Y</math>的熵记为<math>H（X ǀ Y）</math>。

−

== ~~Definition~~ 定义 ==

+

== 定义 ==

−

~~The conditional entropy of <math>Y</math> given <math>X</math> is defined as~~

在给定<math>X</math>的情况下，<math>Y</math>的条件熵定义为：

−

第28行：第22行：

|background colour=#F5FFFA}}

−

~~where <math>\mathcal X</math> and <math>\mathcal Y</math> denote the [[Support (mathematics)|support sets]] of <math>X</math> and <math>Y</math>.~~

其中<math>\mathcal X</math>和<math>\mathcal Y</math>表示<math>X</math>和<math>Y</math>的'''支撑集 support sets'''。

+

'''注意'''：约定<math>c > 0</math>始终成立时，表达式<math>0 \log 0</math>和<math>0 \log c/0</math>视为等于零。这是因为<math>\lim_{\theta\to0^+} \theta\, \log \,c/\theta = 0</math>，而且<math>\lim_{\theta\to0^+} \theta\, \log \theta = 0</math>><ref>{{Cite web|url=http://www.inference.org.uk/mackay/itprnn/book.html|title=David MacKay: Information Theory, Pattern Recognition and Neural Networks: The Book|website=www.inference.org.uk|access-date=2019-10-25}}</ref>

−

''Note:'' It is conventioned that the expressions <math>0 \log 0</math> and <math>0 \log c/0</math> for fixed <math>c > 0</math> should be treated as being equal to zero. This is because <math>\lim_{\theta\to0^+} \theta\, \log \,c/\theta = 0</math> and <math>\lim_{\theta\to0^+} \theta\, \log \theta = 0</math><ref>{{Cite web|url=http://www.inference.org.uk/mackay/itprnn/book.html|title=David MacKay: Information Theory, Pattern Recognition and Neural Networks: The Book|website=www.inference.org.uk|access-date=2019-10-25}}</ref>

−

注意：约定<math>c > 0</math>始终成立时，表达式<math>0 \log 0</math>和<math>0 \log c/0</math>视为等于零。这是因为<math>\lim_{\theta\to0^+} \theta\, \log \,c/\theta = 0</math>，而且<math>\lim_{\theta\to0^+} \theta\, \log \theta = 0</math>><ref>{{Cite web|url=http://www.inference.org.uk/mackay/itprnn/book.html|title=David MacKay: Information Theory, Pattern Recognition and Neural Networks: The Book|website=www.inference.org.uk|access-date=2019-10-25}}</ref>

−

Intuitive explanation of the definition : According to the definition, <math>\displaystyle H( Y|X) =\mathbb{E}( \ f( X,Y) \ )</math> where <math>\displaystyle f:( x,y) \ \rightarrow -\log( \ p( y|x) \ ) .</math> <math>\displaystyle f</math> associates to <math>\displaystyle ( x,y)</math> the information content of <math>\displaystyle ( Y=y)</math> given <math>\displaystyle (X=x)</math>, which is the amount of information needed to describe the event <math>\displaystyle (Y=y)</math> given <math>(X=x)</math>. According to the law of large numbers, <math>\displaystyle H(Y|X)</math> is the arithmetic mean of a large number of independent realizations of <math>\displaystyle f(X,Y)</math>.

对该定义的直观解释是：根据定义<math>\displaystyle H( Y|X) =\mathbb{E}( \ f( X,Y) \ )</math>，其中<math>\displaystyle f:( x,y) \ \rightarrow -\log( \ p( y|x) \ ) </math>. <math>\displaystyle f</math>将给定<math>\displaystyle (X=x)</math>时的<math>\displaystyle ( Y=y)</math>的信息内容与<math>\displaystyle ( x,y)</math>相关联，这是描述在给定<math>(X=x)</math>条件下的事件<math>\displaystyle (Y=y)</math>所需的信息量。根据大数定律，<math>H（Y ǀ X）</math>是大量<math>\displaystyle f(X,Y)</math>独立实验结果的算术平均值。

第48行：第33行： −

== ~~Motivation~~ 动机 ==

+

== 动机 ==

−

Let <math>H（Y ǀ X = x）</math> be the [[Shannon Entropy|entropy]] of the discrete random variable <math>Y</math> conditioned on the discrete random variable <math>X</math> taking a certain value <math>x</math>. Denote the support sets of <math>X</math> and <math>Y</math> by <math>\mathcal X</math> and <math>\mathcal Y</math>. Let <math>Y</math> have [[probability mass function]] <math>p_Y{(y)}</math>. The unconditional entropy of <math>Y</math> is calculated as <math>H（Y）:=E[I（Y）</math>, i.e.

设<math>H（Y ǀ X = x）</math>为离散随机变量<math>Y</math>在离散随机变量<math>X</math>取定值<math>x</math>时的熵。用<math>\mathcal X</math>和<math>\mathcal Y</math>表示<math>X</math>和<math>Y</math>的支撑集。令<math>Y</math>的概率密度函数为<math>p_Y{(y)}</math>。<math>Y</math>的无条件熵计算为<math>H（Y）:=E[I（Y）</math>。

第57行：第41行：

= -\sum_{y\in\mathcal Y} {p_Y(y) \log_2{p_Y(y)}},</math>

−

where <math>\operatorname{I}(y_i)</math> is the [[information content]] of the [[Outcome (probability)|outcome]] of <math>Y</math> taking the value <math>y_i</math>. The entropy of <math>Y</math> conditioned on <math>X</math> taking the value <math>x</math> is defined analogously by [[conditional expectation]]:

当<math>Y</math>取值为<math>y_i</math>时，<math>\operatorname{I}(y_i)</math>是其结果<math>Y</math>的信息内容。类似地，当<math>X</math>值为<math>x</math>时以<math>X</math>为条件的<math>Y</math>的熵，也可以通过条件期望来定义：

第67行：第49行： −

~~Note that~~<math> H（Y ǀ X）</math> ~~is the result of averaging~~ <math>~~H（Y ǀ~~ X ~~= x）~~</math> ~~over all possible values~~ <math>x</math> ~~that~~ <math>X</math> ~~may take. Also, if the above sum is taken over a sample~~ <math>y_1, \dots, y_n</math>~~, the expected value~~ <math>E_X[ H(y_1, \dots, y_n \mid X = x)]</math> ~~is known in some domains as '''equivocation'''.~~<ref>{{cite journal|author1=Hellman, M.|author2=Raviv, J.|year=1970|title=Probability of error, equivocation, and the Chernoff bound|journal=IEEE Transactions on Information Theory|volume=16|issue=4|pp=368-372}}</ref>

+

'''注意'''，<math> H（Y ǀ X）</math>是在<math>X</math>可能取的所有可能值<math>x</math>时对<math>H（Y ǀ X = x）</math>求平均值的结果。同样，如果上述和取自样本<math>y_1, \dots, y_n</math>上，则期望值<math>E_X[ H(y_1, \dots, y_n \mid X = x)]</math> 在某些领域中认为是模糊值。<ref>{{cite journal|author1=Hellman, M.|author2=Raviv, J.|year=1970|title=Probability of error, equivocation, and the Chernoff bound|journal=IEEE Transactions on Information Theory|volume=16|issue=4|pp=368-372}}</ref>

−

注意，<math> H（Y ǀ X）</math>是在<math>X</math>可能取的所有可能值<math>x</math>时对<math>H（Y ǀ X = x）</math>求平均值的结果。同样，如果上述和取自样本<math>y_1, \dots, y_n</math>上，则期望值<math>E_X[ H(y_1, \dots, y_n \mid X = x)]</math> 在某些领域中认为是模糊值。<ref>{{cite journal|author1=Hellman, M.|author2=Raviv, J.|year=1970|title=Probability of error, equivocation, and the Chernoff bound|journal=IEEE Transactions on Information Theory|volume=16|issue=4|pp=368-372}}</ref>

−

Given [[Discrete random variable|discrete random variables]] <math>X</math> with image <math>\mathcal X</math> and <math>Y</math> with image <math>\mathcal Y</math>, the conditional entropy of <math>Y</math> given <math>X</math> is defined as the weighted sum of <math>H(Y|X=x)</math> for each possible value of <math>x</math>, using <math>p(x)</math> as the weights:<ref name=cover1991>{{cite book|isbn=0-471-06259-6|year=1991|authorlink1=Thomas M. Cover|author1=T. Cover|author2=J. Thomas|title=Elements of Information Theory|url=https://archive.org/details/elementsofinform0000cove|url-access=registration}}</ref>{{rp|15}}

给定具有像<math>\mathcal X</math>的离散随机变量<math>X</math>和具有像<math>\mathcal Y</math>的离散随机变量<math>Y</math>，将给定<math>X</math>的<math>Y</math>的条件熵定义为以<math>p(x)</math>作为权重，对<math>x</math>的每个可能取值得到的<math>H(Y|X=x)</math>的加权和。其表达式如下：<ref name=cover1991>{{cite book|isbn=0-471-06259-6|year=1991|authorlink1=Thomas M. Cover|author1=T. Cover|author2=J. Thomas|title=Elements of Information Theory|url=https://archive.org/details/elementsofinform0000cove|url-access=registration}}</ref>{{rp|15}}

第89行：第66行：

</math>

−

−

~~== Properties 属性 ==~~

−

~~=== Conditional entropy equals zero 条件熵等于零 ===~~

−

~~<math>H(Y|X)=0</math> if and only if the value of <math>Y</math> is completely determined by the value of <math>X</math>.~~

+

== 属性 ==

+

=== 条件熵等于零 Conditional entropy equals zero ===

当且仅当<math>Y</math>的值完全由<math>X</math>的值确定时，<math>H(Y|X)=0</math>。

−

+

===独立随机变量的条件熵 Conditional entropy of independent random variables ===

−

=== Conditional entropy of independent random variables ~~独立随机变量的条件熵 =~~==

−

~~Conversely, <math>H(Y|X)~~ = ~~H(Y)</math> if and only if <math>Y</math> and <math>X</math> are [[independent random variables]].~~

相反，当且仅当<math>Y</math>和<math>X</math>是互相独立的随机变量时，则<math>H(Y|X) =H(Y)</math>。

第108行：第78行： −

=== Chain rule ~~链式法则~~ ===

+

=== 链式法则 Chain rule ===

−

~~Assume that the combined system determined by two random variables~~ <math>X</math> ~~and~~ <math>Y</math> ~~has~~ [[~~joint entropy~~]] <math>H(X,Y)</math>, that is, we need <math>H(X,Y)</math> bits of information on average to describe its exact state. Now if we first learn the value of <math>X</math>, we have gained <math>H(X)</math> bits of information. Once <math>X</math> is known, we only need <math>H(X,Y)-H(X)</math> bits to describe the state of the whole system. This quantity is exactly <math>H(Y|X)</math>, which gives the ''chain rule'' of conditional entropy:

+

假设由两个随机变量<math>X</math>和<math>Y</math>确定的组合系统具有[[联合熵]]<math>H(X,Y)</math>，也就是说，我们通常需要<math>H(X,Y)</math>位信息来描述其确切状态。现在，如果我们首先尝试获得<math>X</math>的值，我们将知晓<math>H(X)</math>位信息。一旦<math>X</math>的值确定，我们就可以通过<math>H(X,Y)</math>-<math>H(X)</math>位来描述整个系统的状态。这个数量恰好是<math>H(Y|X)</math>，它给出了条件熵的链式法则：

−

~~假设由两个随机变量<math>X</math>和<math>Y</math>确定的组合系统具有联合熵~~<math>H(X,Y)</math>，也就是说，我们通常需要<math>H(X,Y)</math>位信息来描述其确切状态。现在，如果我们首先尝试获得<math>X</math>的值，我们将知晓<math>H(X)</math>位信息。一旦<math>X</math>的值确定，我们就可以通过<math>H(X,Y)</math>-<math>H(X)</math>位来描述整个系统的状态。这个数量恰好是<math>H(Y|X)</math>，它给出了条件熵的链式法则：

:<math>H(Y|X)\, = \, H(X,Y)- H(X).</math><ref name=cover1991 />{{rp|17}}

−

~~The chain rule follows from the above definition of conditional entropy:~~

链式法则遵循以上条件熵的定义：

第130行：第96行：

\end{align}</math>

−

~~In general, a chain rule for multiple random variables holds:~~

通常情况下，多个随机变量的链式法则表示为：

第139行：第103行：

\sum_{i=1}^n H(X_i | X_1, \ldots, X_{i-1}) </math><ref name=cover1991 />{{rp|22}}

−

~~It has a similar form to [[Chain rule (probability)|chain rule]] in probability theory, except that addition instead of multiplication is used.~~

除了使用加法而不是乘法之外，它具有与概率论中的链式法则类似的形式。

第146行：第108行： −

=== Bayes' rule ~~贝叶斯法则~~ ===

+

===贝叶斯法则 Bayes' rule ===

−

[[Bayes' rule]] ~~for conditional entropy states~~

+

条件熵状态的[[贝叶斯法则 Bayes' rule]]

−

~~条件熵状态的贝叶斯法则~~

第155行：第115行： −

''~~Proof.~~'' ~~<math>H(Y|X) = H(X,Y) - H(X)</math> and <math>H(X|Y) = H(Y,X) - H(Y)</math>. Symmetry entails <math>H(X,Y) = H(Y,X)</math>. Subtracting the two equations implies Bayes~~' ~~rule.~~

+

'''证明'''，<math>H(Y|X) = H(X,Y) - H(X)</math> 和 <math>H(X|Y) = H(Y,X) - H(Y)</math>。对称性要求<math>H(X,Y) = H(Y,X)</math>。将两个方程式相减就意味着贝叶斯定律。

−

~~证明，~~<math>H(Y|X) = H(X,Y) - H(X)</math> 和 <math>H(X|Y) = H(Y,X) - H(Y)</math>。对称性要求<math>H(X,Y) = H(Y,X)</math>。将两个方程式相减就意味着贝叶斯定律。

−

~~If <math>Y</math> is [[Conditional independence|conditionally independent]] of <math>Z</math> given <math>X</math> we have:~~

如果给定<math>X</math>，<math>Y</math>有条件地独立于<math>Z</math>，则有：

第169行：第124行： −

+

===其他性质 ===

−

=== ~~Other properties~~ 其他性质 ===

−

~~For any <math>X</math> and <math>Y</math>:~~

−

对于任何<math>X</math>和<math>Y</math>：

第183行：第135行：

\end{align}</math>

−

~~where <math>\operatorname{I}(X;Y)</math> is the [[mutual information]] between <math>X</math> and <math>Y</math>.~~

−

其中<math>\operatorname{I}(X;Y)</math>是<math>X</math>和<math>Y</math>之间的~~~~互信息 mutual information~~~~。

+

其中<math>\operatorname{I}(X;Y)</math>是<math>X</math>和<math>Y</math>之间的[[互信息 mutual information]]。

−

+

对于独立的<math>X</math>和<math>Y</math>：

−

~~For independent~~ <math>X</math> ~~and~~ <math>Y</math>:

−

~~对于独立的X和Y：~~

:<math>H(Y|X) = H(Y) </math> and <math>H(X|Y) = H(X) \, </math>

−

Although the specific-conditional entropy <math>H(X|Y=y)</math> can be either less or greater than <math>H(X)</math> for a given [[random variate]] <math>y</math> of <math>Y</math>, <math>H(X|Y)</math> can never exceed <math>H(X)</math>.

对于给定随机变量<math>Y</math>的值<math>y</math>，尽管特定条件熵<math>H(X|Y=y)</math>可以小于或大于<math>H(X)</math>，但<math>H(X|Y)</math>永远不会超过<math>H(X)</math>。

第203行：第149行： −

== Conditional differential entropy ~~条件微分熵~~ ==

+

== 条件微分熵 Conditional differential entropy ==

−

=== ~~Definition~~ 定义 ===

+

=== 定义 ===

−

The above definition is for discrete random variables. The continuous version of discrete conditional entropy is called ''conditional differential (or continuous) entropy''. Let <math>X</math> and <math>Y</math> be a continuous random variables with a [[joint probability density function]] <math>f(x,y)</math>. The differential conditional entropy <math>h(X|Y)</math> is defined as<ref name=cover1991 />{{rp|249}}

−

上面的定义是针对离散随机变量的。离散条件熵的连续形式称为'''条件微分（或连续）熵 Conditional differential (or continuous) entropy '''。令<math>X</math>和<math>Y</math>为具有联合概率密度函数<math>f(x,y)</math>的连续随机变量。则微分条件熵<math>h(X|Y)</math>定义为：<ref name=cover1991 />{{rp|249}}

第220行：第164行： −

=== Properties 属性 ===

+

===属性 Properties ===

−

~~In contrast to the conditional entropy for discrete random variables, the conditional differential entropy may be negative.~~

−

与离散随机变量的条件熵相比，条件微分熵可能为负。

−

~~As in the discrete case there is a chain rule for differential entropy:~~

与离散情况一样，微分熵也有链式法则：

第234行：第173行：

:<math>h(Y|X)\,=\,h(X,Y)-h(X)</math><ref name=cover1991 />{{rp|253}}

−

~~Notice however that this rule may not be true if the involved differential entropies do not exist or are infinite.~~

但是请注意，如果所涉及的微分熵不存在或无限，则此法则可能不成立。

第241行：第178行： −

Joint differential ~~entropy is also used in the definition of the [[mutual information]] between continuous random variables:~~

+

联合微分熵 Joint differential entropy也用于定义连续随机变量之间的互信息：

−

~~联合微分熵也用于定义连续随机变量之间的互信息：~~

第249行：第184行： −

<math>~~h(X|Y) \le h(~~X)</math> ~~with equality if and only if~~ <math>X</math> ~~and~~ <math>Y</math> ~~are independent.~~<ref name=cover1991 />{{rp|253}}

+

当且仅当<math>X</math>和<math>Y</math>是独立的，<math>h(X|Y) \le h(X)</math>等号成立。<ref name=cover1991 />{{rp|253}}

−

~~当且仅当X和Y是独立的，<math>h(X|Y) \le h(X)</math>等号成立。~~

−

~~=== Relation to estimator error 与估计量误差的关系 ===~~

−

The conditional differential entropy yields a lower bound on the expected squared error of an [[estimator]]. For any random variable <math>X</math>, observation <math>Y</math> and estimator <math>\widehat{X}</math> the following holds:<ref name=cover1991 />{{rp|255}}

+

=== 与估计量误差的关系Relation to estimator error ===

条件微分熵在估计量的期望平方误差上有一个下限。对于任何随机变量<math>X</math>，观察值<math>Y</math>和估计量<math>\widehat{X}</math>，以下条件成立：

第265行：第196行： −

~~This is related to the [[uncertainty principle]] from~~ [[quantum mechanics]].

+

这与来自[[量子力学 quantum mechanics]]的[[不确定性原理 uncertainty principle]]有关。

−

~~这与来自量子力学的不确定性原理有关。~~

−

~~== Generalization to quantum theory 量子理论泛化 ==~~

−

In [[~~quantum information theory]], the conditional entropy is generalized to the [[conditional quantum entropy~~]]~~. The latter can take negative values, unlike its classical counterpart.~~

−

~~在量子信息论中，条件熵被广义化为条件量子熵。后者可以采用负值，这与经典方法不同。~~

−

== ~~See also 其他参考资料~~ ==

+

== 量子理论泛化 Generalization to quantum theory ==

−

* [[~~Entropy (~~information theory~~)]]~~

+

在[[量子信息论 quantum information theory]]中，条件熵被广义化为条件量子熵。后者可以采用负值，这与经典方法不同。

−

* [[Mutual information]]

−

* [[Conditional quantum entropy]]

−

* [[Variation of information]]

−

* [[Entropy power inequality]]

−

* [[Likelihood function]]

−

* '''熵（信息论）Entropy (information theory)'''

−

* '''互信息 Mutual information'''

−

* '''条件量子熵 Conditional quantum entropy'''

−

* '''信息差异 Variation of information'''

−

* '''熵幂不等式 Entropy power inequality'''

−

* '''似然函数 Likelihood function'''

+

== 另见 ==

+

* [[熵（信息论） Entropy (information theory)]]

+

* [[互信息 Mutual information]]

+

* [[条件量子熵 Conditional quantum entropy]]

+

* [[信息差异 Variation of information]]

+

* [[熵幂不等式 Entropy power inequality]]

+

* [[似然函数 Likelihood function]]

第298行：第216行：

−

[[Category:~~Entropy and information~~]]

+

[[Category:熵和信息]]

−

[[Category:~~Information theory~~]]

+

[[Category:信息理论]]

薄荷

7,129

个编辑

更改

条件熵 (查看源代码)

2021年1月17日 (日) 21:49的版本

导航菜单

搜索