更改

条件熵 (查看源代码)

2020年12月3日 (四) 19:10的版本

添加112字节、 2020年12月3日 (四) 19:10

无编辑摘要

第1行：第1行： −

~~此词条由Jie翻译。~~

+

此词条由Jie翻译。由Lincent审校。

{{Short description|Measure of relative information in probability theory}}

第8行：第8行：

In [[information theory]], the '''conditional entropy''' quantifies the amount of information needed to describe the outcome of a [[random variable]] <math>Y</math> given that the value of another random variable <math>X</math> is known. Here, information is measured in [[Shannon (unit)|shannon]]s, [[Nat (unit)|nat]]s, or [[Hartley (unit)|hartley]]s. The ''entropy of <math>Y</math> conditioned on <math>X</math>'' is written as H（X ǀ Y）.

−

在''' 信息论Information theory'''中，假设随机变量<math>X</math>的值已知，那么''' 条件熵Conditional entropy'''~~则用于去量化描述随机变量~~<math>Y</math>~~结果所需的信息量。此时，信息以~~''' 香农Shannon '''，''' 奈特nat'''或''' 哈特莱hartley'''~~来衡量。以~~<math>X</math>~~为条件的~~<math>Y</math>~~熵写为~~<math>H（X ǀ Y）</math>。

+

在''' 信息论Information theory'''中，假设随机变量<math>X</math>的值已知，那么''' 条件熵Conditional entropy'''则用于去定量描述随机变量<math>Y</math>表示的信息量。此时，信息以''' 香农Shannon '''，''' 奈特nat'''或''' 哈特莱hartley'''来衡量。已知<math>X</math>的条件下<math>Y</math>的熵记为<math>H（X ǀ Y）</math>。

第32行：第32行：

where <math>\mathcal X</math> and <math>\mathcal Y</math> denote the [[Support (mathematics)|support sets]] of <math>X</math> and <math>Y</math>.

−

其中<math>\mathcal X</math>和<math>\mathcal Y</math>表示<math>X</math>和<math>Y</math>~~的支撑集。~~

+

其中<math>\mathcal X</math>和<math>\mathcal Y</math>表示<math>X</math>和<math>Y</math>的支撑集。

第38行：第38行：

''Note:'' It is conventioned that the expressions <math>0 \log 0</math> and <math>0 \log c/0</math> for fixed <math>c > 0</math> should be treated as being equal to zero. This is because <math>\lim_{\theta\to0^+} \theta\, \log \,c/\theta = 0</math> and <math>\lim_{\theta\to0^+} \theta\, \log \theta = 0</math><ref>{{Cite web|url=http://www.inference.org.uk/mackay/itprnn/book.html|title=David MacKay: Information Theory, Pattern Recognition and Neural Networks: The Book|website=www.inference.org.uk|access-date=2019-10-25}}</ref>

−

~~注意：在约定~~<math>c > 0</math>始终成立时，表达式<math>0 \log 0</math>和<math>0 \log c/0</math>视为等于零。这是因为<math>\lim_{\theta\to0^+} \theta\, \log \,c/\theta = 0</math>，而且<math>\lim_{\theta\to0^+} \theta\, \log \theta = 0</math>><ref>{{Cite web|url=http://www.inference.org.uk/mackay/itprnn/book.html|title=David MacKay: Information Theory, Pattern Recognition and Neural Networks: The Book|website=www.inference.org.uk|access-date=2019-10-25}}</ref>

+

注意：约定<math>c > 0</math>始终成立时，表达式<math>0 \log 0</math>和<math>0 \log c/0</math>视为等于零。这是因为<math>\lim_{\theta\to0^+} \theta\, \log \,c/\theta = 0</math>，而且<math>\lim_{\theta\to0^+} \theta\, \log \theta = 0</math>><ref>{{Cite web|url=http://www.inference.org.uk/mackay/itprnn/book.html|title=David MacKay: Information Theory, Pattern Recognition and Neural Networks: The Book|website=www.inference.org.uk|access-date=2019-10-25}}</ref>

第44行：第44行：

Intuitive explanation of the definition : According to the definition, <math>\displaystyle H( Y|X) =\mathbb{E}( \ f( X,Y) \ )</math> where <math>\displaystyle f:( x,y) \ \rightarrow -\log( \ p( y|x) \ ) .</math> <math>\displaystyle f</math> associates to <math>\displaystyle ( x,y)</math> the information content of <math>\displaystyle ( Y=y)</math> given <math>\displaystyle (X=x)</math>, which is the amount of information needed to describe the event <math>\displaystyle (Y=y)</math> given <math>(X=x)</math>. According to the law of large numbers, <math>\displaystyle H(Y|X)</math> is the arithmetic mean of a large number of independent realizations of <math>\displaystyle f(X,Y)</math>.

−

对该定义的直观解释是：根据定义<math>\displaystyle H( Y|X) =\mathbb{E}( \ f( X,Y) \ )</math>，其中<math>\displaystyle f:( x,y) \ \rightarrow -\log( \ p( y|x) \ ) </math>. <math>\displaystyle f</math>将给定<math>\displaystyle (X=x)</math>的<math>\displaystyle ( Y=y)</math>的信息内容与<math>\displaystyle ( x,y)</math>相关联，这是描述在给定<math>(X=x)</math>条件下的事件<math>\displaystyle (Y=y)</math>所需的信息量。根据大数定律，<math>H（Y ǀ X）</math>是<math>\displaystyle f(X,Y)</math>~~的大量独立实现的算术平均值。~~

+

对该定义的直观解释是：根据定义<math>\displaystyle H( Y|X) =\mathbb{E}( \ f( X,Y) \ )</math>，其中<math>\displaystyle f:( x,y) \ \rightarrow -\log( \ p( y|x) \ ) </math>. <math>\displaystyle f</math>将给定<math>\displaystyle (X=x)</math>时的<math>\displaystyle ( Y=y)</math>的信息内容与<math>\displaystyle ( x,y)</math>相关联，这是描述在给定<math>(X=x)</math>条件下的事件<math>\displaystyle (Y=y)</math>所需的信息量。根据大数定律，<math>H（Y ǀ X）</math>是大量<math>\displaystyle f(X,Y)</math>独立实验结果的算术平均值。

第51行：第51行：

Let <math>H（Y ǀ X = x）</math> be the [[Shannon Entropy|entropy]] of the discrete random variable <math>Y</math> conditioned on the discrete random variable <math>X</math> taking a certain value <math>x</math>. Denote the support sets of <math>X</math> and <math>Y</math> by <math>\mathcal X</math> and <math>\mathcal Y</math>. Let <math>Y</math> have [[probability mass function]] <math>p_Y{(y)}</math>. The unconditional entropy of <math>Y</math> is calculated as <math>H（Y）:=E[I（Y）</math>, i.e.

−

设<math>H（Y ǀ X = x）</math>为离散随机变量<math>Y</math>~~的熵，条件是离散随机变量~~<math>X</math>~~取一定值~~<math>x</math>。用<math>\mathcal X</math>和<math>\mathcal Y</math>表示<math>X</math>和<math>Y</math>的支撑集。令<math>Y</math>~~具有概率质量函数~~<math>p_Y{(y)}</math>。<math>Y</math>的无条件熵计算为<math>H（Y）:=E[I（Y）</math>。

+

设<math>H（Y ǀ X = x）</math>为离散随机变量<math>Y</math>在离散随机变量<math>X</math>取定值<math>x</math>时的熵。用<math>\mathcal X</math>和<math>\mathcal Y</math>表示<math>X</math>和<math>Y</math>的支撑集。令<math>Y</math>的概率密度函数为<math>p_Y{(y)}</math>。<math>Y</math>的无条件熵计算为<math>H（Y）:=E[I（Y）</math>。

第60行：第60行：

where <math>\operatorname{I}(y_i)</math> is the [[information content]] of the [[Outcome (probability)|outcome]] of <math>Y</math> taking the value <math>y_i</math>. The entropy of <math>Y</math> conditioned on <math>X</math> taking the value <math>x</math> is defined analogously by [[conditional expectation]]:

−

~~这里当取值为~~<math>y_i</math>时，<math>\operatorname{I}(y_i)</math>是其结果<math>Y</math>~~的信息内容。类似地以~~<math>X</math>~~为条件的~~<math>Y</math>~~的熵，当值为~~<math>x</math>~~时，也可以通过条件期望来定义：~~

+

当<math>Y</math>取值为<math>y_i</math>时，<math>\operatorname{I}(y_i)</math>是其结果<math>Y</math>的信息内容。类似地，当<math>X</math>值为<math>x</math>时以<math>X</math>为条件的<math>Y</math>的熵，也可以通过条件期望来定义：

第69行：第69行：

Note that<math> H（Y ǀ X）</math> is the result of averaging <math>H（Y ǀ X = x）</math> over all possible values <math>x</math> that <math>X</math> may take. Also, if the above sum is taken over a sample <math>y_1, \dots, y_n</math>, the expected value <math>E_X[ H(y_1, \dots, y_n \mid X = x)]</math> is known in some domains as '''equivocation'''.<ref>{{cite journal|author1=Hellman, M.|author2=Raviv, J.|year=1970|title=Probability of error, equivocation, and the Chernoff bound|journal=IEEE Transactions on Information Theory|volume=16|issue=4|pp=368-372}}</ref>

−

注意，<math> H（Y ǀ X）</math>是在<math>X</math>可能取的所有可能值<math>x</math>上对<math>H（Y ǀ X = x）</math>~~求平均值的结果。同样，如果将上述总和接管到样本~~<math>y_1, \dots, y_n</math>~~上，则预期值~~<math>E_X[ H(y_1, \dots, y_n \mid X = x)]</math>~~在某些领域中会变得模糊。~~<ref>{{cite journal|author1=Hellman, M.|author2=Raviv, J.|year=1970|title=Probability of error, equivocation, and the Chernoff bound|journal=IEEE Transactions on Information Theory|volume=16|issue=4|pp=368-372}}</ref>

+

注意，<math> H（Y ǀ X）</math>是在<math>X</math>可能取的所有可能值<math>x</math>时对<math>H（Y ǀ X = x）</math>求平均值的结果。同样，如果上述和取自样本<math>y_1, \dots, y_n</math>上，则期望值<math>E_X[ H(y_1, \dots, y_n \mid X = x)]</math> 在某些领域中认为是模糊值。<ref>{{cite journal|author1=Hellman, M.|author2=Raviv, J.|year=1970|title=Probability of error, equivocation, and the Chernoff bound|journal=IEEE Transactions on Information Theory|volume=16|issue=4|pp=368-372}}</ref>

第75行：第75行：

Given [[Discrete random variable|discrete random variables]] <math>X</math> with image <math>\mathcal X</math> and <math>Y</math> with image <math>\mathcal Y</math>, the conditional entropy of <math>Y</math> given <math>X</math> is defined as the weighted sum of <math>H(Y|X=x)</math> for each possible value of <math>x</math>, using <math>p(x)</math> as the weights:<ref name=cover1991>{{cite book|isbn=0-471-06259-6|year=1991|authorlink1=Thomas M. Cover|author1=T. Cover|author2=J. Thomas|title=Elements of Information Theory|url=https://archive.org/details/elementsofinform0000cove|url-access=registration}}</ref>{{rp|15}}

−

给定具有像<math>\mathcal X</math>的离散随机变量<math>X</math>和具有像<math>\mathcal Y</math>的离散随机变量<math>Y</math>，将给定<math>X</math>的<math>Y</math>~~的条件熵定义为~~<math>H(~~Y|X=~~x)</math>~~的权重之和，以~~<math>x</math>~~的每个可能值为准，并使用~~<math>p(x)</math>~~作为权重，其表达式如下：~~<ref name=cover1991>{{cite book|isbn=0-471-06259-6|year=1991|authorlink1=Thomas M. Cover|author1=T. Cover|author2=J. Thomas|title=Elements of Information Theory|url=https://archive.org/details/elementsofinform0000cove|url-access=registration}}</ref>{{rp|15}}

+

给定具有像<math>\mathcal X</math>的离散随机变量<math>X</math>和具有像<math>\mathcal Y</math>的离散随机变量<math>Y</math>，将给定<math>X</math>的<math>Y</math>的条件熵定义为以<math>p(x)</math>作为权重，对<math>x</math>的每个可能取值得到的<math>H(Y|X=x)</math>的加权和。其表达式如下：<ref name=cover1991>{{cite book|isbn=0-471-06259-6|year=1991|authorlink1=Thomas M. Cover|author1=T. Cover|author2=J. Thomas|title=Elements of Information Theory|url=https://archive.org/details/elementsofinform0000cove|url-access=registration}}</ref>{{rp|15}}

第97行：第97行：

<math>H(Y|X)=0</math> if and only if the value of <math>Y</math> is completely determined by the value of <math>X</math>.

−

当且仅当<math>Y</math>的值完全由<math>X</math>~~的值确定时，才为~~<math>H(Y|X)=0</math>。

+

当且仅当<math>Y</math>的值完全由<math>X</math>的值确定时，<math>H(Y|X)=0</math>。

第104行：第104行：

Conversely, <math>H(Y|X) = H(Y)</math> if and only if <math>Y</math> and <math>X</math> are [[independent random variables]].

−

相反，当且仅当<math>Y</math>和<math>X</math>~~是独立随机变量时，则为~~<math>H(Y|X) =H(Y)</math>。

+

相反，当且仅当<math>Y</math>和<math>X</math>是互相独立的随机变量时，则<math>H(Y|X) =H(Y)</math>。

第111行：第111行：

Assume that the combined system determined by two random variables <math>X</math> and <math>Y</math> has [[joint entropy]] <math>H(X,Y)</math>, that is, we need <math>H(X,Y)</math> bits of information on average to describe its exact state. Now if we first learn the value of <math>X</math>, we have gained <math>H(X)</math> bits of information. Once <math>X</math> is known, we only need <math>H(X,Y)-H(X)</math> bits to describe the state of the whole system. This quantity is exactly <math>H(Y|X)</math>, which gives the ''chain rule'' of conditional entropy:

−

假设由两个随机变量<math>X</math>和<math>Y</math>确定的组合系统具有联合熵<math>H(X,Y)</math>，也就是说，我们通常需要<math>H(X,Y)</math>~~位信息来描述其确切状态。现在，如果我们首先获得~~<math>X</math>的值，我们将知晓<math>H(X)</math>~~位信息。一旦知道了~~<math>X</math>~~的值，我们就可以通过~~<math>H(X,Y)</math>-<math>H(X)</math>位来描述整个系统的状态。这个数量恰好是<math>H(Y|X)</math>，它给出了条件熵的链式法则：

+

假设由两个随机变量<math>X</math>和<math>Y</math>确定的组合系统具有联合熵<math>H(X,Y)</math>，也就是说，我们通常需要<math>H(X,Y)</math>位信息来描述其确切状态。现在，如果我们首先尝试获得<math>X</math>的值，我们将知晓<math>H(X)</math>位信息。一旦<math>X</math>的值确定，我们就可以通过<math>H(X,Y)</math>-<math>H(X)</math>位来描述整个系统的状态。这个数量恰好是<math>H(Y|X)</math>，它给出了条件熵的链式法则：

第185行：第185行：

where <math>\operatorname{I}(X;Y)</math> is the [[mutual information]] between <math>X</math> and <math>Y</math>.

−

其中<math>\operatorname{I}(X;Y)</math>是<math>X</math>和<math>Y</math>~~之间的相互信息。~~

+

其中<math>\operatorname{I}(X;Y)</math>是<math>X</math>和<math>Y</math>之间的 互信息。

第199行：第199行：

Although the specific-conditional entropy <math>H(X|Y=y)</math> can be either less or greater than <math>H(X)</math> for a given [[random variate]] <math>y</math> of <math>Y</math>, <math>H(X|Y)</math> can never exceed <math>H(X)</math>.

−

~~尽管对于给定的~~<math>Y</math>~~随机变量~~<math>y</math>~~，特定条件熵~~<math>H(X|Y=y)</math>可以小于或大于<math>H(X)</math>，但<math>H(X|Y)</math>永远不会超过<math>H(X)</math>。

+

对于给定随机变量<math>Y</math>的值<math>y</math>，尽管特定条件熵<math>H(X|Y=y)</math>可以小于或大于<math>H(X)</math>，但<math>H(X|Y)</math>永远不会超过<math>H(X)</math>。

第237行：第237行：

Notice however that this rule may not be true if the involved differential entropies do not exist or are infinite.

−

~~但是请注意，如果所涉及的微分熵不存在或无限，则此规则可能不成立。~~

+

但是请注意，如果所涉及的微分熵不存在或无限，则此法则可能不成立。

第243行：第243行：

Joint differential entropy is also used in the definition of the [[mutual information]] between continuous random variables:

−

~~联合微分熵也用于定义连续随机变量之间的交互信息：~~

+

联合微分熵也用于定义连续随机变量之间的互信息：

第251行：第251行：

<math>h(X|Y) \le h(X)</math> with equality if and only if <math>X</math> and <math>Y</math> are independent.<ref name=cover1991 />{{rp|253}}

−

~~当且仅当X和Y是独立的时，~~<math>h(X|Y) \le h(X)</math>~~才相等。~~

+

当且仅当X和Y是独立的，<math>h(X|Y) \le h(X)</math>等号成立。

−

=== Relation to estimator error ~~与预估误差的关系~~ ===

+

=== Relation to estimator error 与估计量误差的关系 ===

The conditional differential entropy yields a lower bound on the expected squared error of an [[estimator]]. For any random variable <math>X</math>, observation <math>Y</math> and estimator <math>\widehat{X}</math> the following holds:<ref name=cover1991 />{{rp|255}}

第287行：第287行：

* ''' 熵（信息论）Entropy (information theory)'''

−

* ''' ~~交互信息Mutual~~ information'''

+

* ''' 互信息Mutual information'''

* ''' 条件量子熵Conditional quantum entropy'''

−

* ''' ~~信息变差Variation~~ of information'''

+

* ''' 信息差异Variation of information'''

* ''' 熵幂不等式Entropy power inequality'''

* ''' 似然函数Likelihood function'''

Lincent

19

个编辑