更改

跳到导航 跳到搜索
添加543字节 、 2024年9月10日 (星期二)
无编辑摘要
第21行: 第21行:  
Formally, EI is a function of the causal mechanism (in a discrete-state Markov chain, this is the probability transition matrix of the Markov chain) and is independent of other factors. The formal definition of EI is:
 
Formally, EI is a function of the causal mechanism (in a discrete-state Markov chain, this is the probability transition matrix of the Markov chain) and is independent of other factors. The formal definition of EI is:
   −
EI(P)≡I(Y;X∣do(X∼U))
+
<math>
 +
EI(P)\equiv I(Y;X|do(X\sim U))
 +
</math>
   −
where P represents the causal mechanism from X to Y, which is a probability transition matrix, [math]p_{ij} \equiv Pr(Y=j|X=i)[/math]; X is the cause variable, Y is the effect variable, and [math]do(X \sim U)[/math] denotes the intervention on X, changing its distribution to a uniform one. Under this intervention, and assuming the causal mechanism P remains unchanged, Y will be indirectly affected by the intervention on X. EI measures the mutual information between X and Y after this intervention.
+
where P represents the causal mechanism from X to Y, which is a probability transition matrix, [math]p_{ij}\equiv Pr(Y=j|X=i)[/math]; X is the cause variable, Y is the effect variable, and [math]do(X\sim U)[/math] denotes the intervention on X, changing its distribution to a uniform one. Under this intervention, and assuming the causal mechanism P remains unchanged, Y will be indirectly affected by the intervention on X. EI measures the mutual information between X and Y after this intervention.
    
The introduction of the "do" operator aims to eliminate the influence of X's distribution on EI, ensuring that the final EI metric is only a function of the causal mechanism f and is independent of X's distribution.
 
The introduction of the "do" operator aims to eliminate the influence of X's distribution on EI, ensuring that the final EI metric is only a function of the causal mechanism f and is independent of X's distribution.
第61行: 第63行:  
|[math]\begin{aligned}&EI(P_1)=2\ bits,\\&Det(P_1)=2\ bits,\\&Deg(P_1)=0\ bits\end{aligned}[/math]||[math]\begin{aligned}&EI(P_2)=0.81\ bits,\\&Det(P_2)=0.81\ bits,\\&Deg(P_2)=0\ bits\end{aligned}[/math]||[math]\begin{aligned}&EI(P_3)=0.81\ bits\\&Det(P_3)=2\ bits,\\&Deg(P_3)=1.19\ bits.\end{aligned}[/math]
 
|[math]\begin{aligned}&EI(P_1)=2\ bits,\\&Det(P_1)=2\ bits,\\&Deg(P_1)=0\ bits\end{aligned}[/math]||[math]\begin{aligned}&EI(P_2)=0.81\ bits,\\&Det(P_2)=0.81\ bits,\\&Deg(P_2)=0\ bits\end{aligned}[/math]||[math]\begin{aligned}&EI(P_3)=0.81\ bits\\&Det(P_3)=2\ bits,\\&Deg(P_3)=1.19\ bits.\end{aligned}[/math]
 
|}{{NumBlk|:||{{EquationRef|example}}}}
 
|}{{NumBlk|:||{{EquationRef|example}}}}
 +
       
As we can see, the EI of the first matrix [math]P_1[/math] is higher than that of the second [math]P_2[/math] because this probability transition is fully deterministic: starting from a particular state, it transitions to another state with 100% probability. However, not all deterministic matrices correspond to high EI, such as matrix [math]P_3[/math]. Although its transition probabilities are also either 100% or 0, because all of the last three states transition to the first state, we cannot distinguish which state it was in the previous moment. Therefore, its EI is low, which we call degeneracy. Hence, if a transition matrix has high determinism and low degeneracy, its EI will be high. Additionally, EI can be decomposed as follows:
 
As we can see, the EI of the first matrix [math]P_1[/math] is higher than that of the second [math]P_2[/math] because this probability transition is fully deterministic: starting from a particular state, it transitions to another state with 100% probability. However, not all deterministic matrices correspond to high EI, such as matrix [math]P_3[/math]. Although its transition probabilities are also either 100% or 0, because all of the last three states transition to the first state, we cannot distinguish which state it was in the previous moment. Therefore, its EI is low, which we call degeneracy. Hence, if a transition matrix has high determinism and low degeneracy, its EI will be high. Additionally, EI can be decomposed as follows:
   −
EI=Det−Deg
+
<math>
 +
EI=Det-Deg
 +
</math>
   −
where Det stands for Determinism, and Deg stands for Degeneracy. EI is the difference between the two. In the table, we also list the values of Det and Deg corresponding to the matrices.
+
Where Det stands for Determinism, and Deg stands for Degeneracy. EI is the difference between the two. In the table, we also list the values of Det and Deg corresponding to the matrices.
    
The first transition probability matrix is a permutation matrix and is reversible; thus, it has the highest determinism, no degeneracy, and therefore the highest EI. The second matrix's first three states transition to each other with a 1/3 probability, resulting in the lowest determinism but also low degeneracy, yielding an EI of 0.81. The third matrix, despite having binary transitions, has high degeneracy because all three states transition to state 1, meaning we cannot infer their previous state. Thus, its EI equals that of the second matrix at 0.81.
 
The first transition probability matrix is a permutation matrix and is reversible; thus, it has the highest determinism, no degeneracy, and therefore the highest EI. The second matrix's first three states transition to each other with a 1/3 probability, resulting in the lowest determinism but also low degeneracy, yielding an EI of 0.81. The third matrix, despite having binary transitions, has high degeneracy because all three states transition to state 1, meaning we cannot infer their previous state. Thus, its EI equals that of the second matrix at 0.81.
第607行: 第612行:  
</math>
 
</math>
   −
其中,[math]x,y\in \mathcal{R}[/math]都是一维实数变量。按照有效信息的定义,我们需要对变量x进行干预,使其满足在其定义域空间上服从均匀分布。如果x的定义域为一个固定的区间,如[a,b],其中a,b都是实数,那么x的概率密度函数就是[math]1/(b-a)[/math]。然而,当x的定义域为全体实数的时候,区间成为了无穷大,而x的概率密度函数就成为了无穷小。
+
Here, [math]x,y\in \mathcal{R}[/math] are both one-dimensional real variables. According to the definition of Effective Information (EI), we need to intervene on variable x so that it follows a uniform distribution over its domain. If the domain of x is a fixed interval, such as [a,b], where a and b are real numbers, the probability density function of x is [math]1/(b-a)[/math]. However, if the domain of x extends over the entire real line, the interval becomes infinite, making the probability density function of x infinitesimally small.
 
  −
为了解决这个问题,我们假设x的定义域不是整个实数空间,而是一个足够大的区域:[math][-L/2,L/2][/math],其中L为该区间的大小。这样,该区域上的均匀分布的密度函数为:[math]1/L[/math]。我们希望当[math]L\rightarrow +\infty[/math]的时候,EI能够收敛到一个有限的数。然而,实际的EI是一个和x定义域大小有关的量,所以EI是参数L的函数。这一点可以从EI的定义中看出:
     −
Here, x,y∈R are both one-dimensional real variables. According to the definition of Effective Information (EI), we need to intervene on variable x so that it follows a uniform distribution over its domain. If the domain of x is a fixed interval, such as [a,b], where a and b are real numbers, the probability density function of x is 1/(b−a). However, if the domain of x extends over the entire real line, the interval becomes infinite, making the probability density function of x infinitesimally small.{{NumBlk|:|
+
To address this issue, we assume that the domain of x is not the entire real line but rather a sufficiently large region: [math][-L/2,L/2][/math], where L represents the size of the interval. In this case, the uniform distribution's density function over this region is [math]1/L[/math]. We hope that as [math]L\rightarrow +\infty[/math], the EI will converge to a finite value. However, EI is actually a quantity dependent on the size of the domain of x, so EI is a function of the parameter L, which can be seen from the definition of EI:{{NumBlk|:|
 
<math>
 
<math>
 
\begin{aligned}
 
\begin{aligned}
第619行: 第622行:  
\end{aligned}
 
\end{aligned}
 
</math>
 
</math>
|{{EquationRef|4}}}}<nowiki>这里,[math]p(y|x)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(y-f(x))^2}{\sigma^2}\right)[/math]为给定x的条件下,y的条件概率密度函数。由于[math]\varepsilon[/math]服从均值为0,方差为[math]\sigma^2[/math]的正态分布,所以[math]y=f(x)+\varepsilon[/math]就服从均值为[math]f(x)[/math],方差为[math]\sigma^2[/math]的正态分布。</nowiki>
+
|{{EquationRef|4}}}}
 +
 
   −
y的积分区间为:[math]f([-\frac{L}{L},\frac{L}{2}])[/math],即将x的定义域[math][-\frac{L}{2},\frac{L}{2}][/math]经过f的映射,形成y上的区间范围。
+
<nowiki>Here, [math]p(y|x)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(y-f(x))^2}{\sigma^2}\right)[/math] is the conditional probability density of y given x. Since [math]\varepsilon[/math] follows a normal distribution with mean 0 and variance [math]\sigma^2[/math], [math]y=f(x)+\varepsilon[/math] follows a normal distribution with mean [math]f(x)[/math] and variance [math]\sigma^2[/math].</nowiki>
 +
 
 +
The integration range of y is [math]f([-\frac{L}{L},\frac{L}{2}])[/math], i.e., the range of y is formed by mapping the domain [math][-\frac{L}{2},\frac{L}{2}][/math] of x through the function f.
    
<math>
 
<math>
第627行: 第633行:  
</math>
 
</math>
   −
为y的概率密度函数,它也可以由联合概率密度函数[math]p(x,y)=p(x)p(y|x)[/math]对x进行积分得到。为了后续叙述方便,我们将x重新命名为[math]x_0[/math],从而以区分出现在{{EquationNote|4}}中的其它x变量。
+
The marginal probability density function of y, p(y), can be obtained by integrating the joint probability density function [math]p(x,y)=p(x)p(y|x)[/math] over x. To facilitate the subsequent discussion, we rename x as [math]x_0[/math]​ to distinguish it from other x variables in equation 4.
   −
由于L很大,所以区间[math][-\frac{L}{2},\frac{L}{2}][/math]很大,进而假设区间[math]f([-\frac{L}{L},\frac{L}{2}])[/math]也很大。这就使得,上述积分的积分上下界可以近似取到无穷大,也就有{{EquationNote|4}}中的第一项为:
+
Since L is large, the interval [math][-\frac{L}{2},\frac{L}{2}][/math] is large, and hence, we assume that the interval [math]f([-\frac{L}{L},\frac{L}{2}])[/math] is also large. This allows us to approximate the integration limits as infinite, which gives the first term in equation 4:
    
<math>
 
<math>
第642行: 第648行:     
然而,要计算第二项,即使使用了积分区间为无穷大这个条件,仍然很难计算得出结果,为此,我们对函数[math]f(x_0)[/math]进行一阶泰勒展开:
 
然而,要计算第二项,即使使用了积分区间为无穷大这个条件,仍然很难计算得出结果,为此,我们对函数[math]f(x_0)[/math]进行一阶泰勒展开:
 +
 +
Here, e is the base of the natural logarithm, and the last equality is derived using the Shannon entropy formula for a Gaussian distribution.
 +
 +
However, calculating the second term remains challenging, even with the assumption of infinite integration limits. Therefore, we perform a first-order Taylor expansion on the function [math]f(x_0)[/math]:
    
<math>
 
<math>
第647行: 第657行:  
</math>
 
</math>
   −
这里,[math]x\in[-\frac{L}{2},\frac{L}{2}][/math]是x定义域上的任意一点。
+
Here, [math]x\in[-\frac{L}{2},\frac{L}{2}][/math] is any point within the domain of x.
   −
因此,p(y)可以被近似计算:
+
Thus, p(y) can be approximated:
    
<math>
 
<math>
第658行: 第668行:     
这样,{{EquationNote|4}}中的第二项近似为:
 
这样,{{EquationNote|4}}中的第二项近似为:
 +
 +
It is important to note that in this step, we not only approximate [math]f(x_0)[/math]as a linear function but also introduce an assumption that the result of p(y) is independent of y and depends on [math]x[/math]. Since the second term of the EI calculation includes an integration over x, this approximation implies that p(y) is approximately different at different values of x.
 +
 +
Thus, the second term in equation 4 can be approximated as:
    
<math>
 
<math>
第665行: 第679行:  
</math>
 
</math>
   −
最终的EI可以由下式近似计算:
+
The final EI can be approximately computed using the following equation:
    
<math>
 
<math>
第678行: 第692行:  
</math>
 
</math>
   −
其中<math>\epsilon</math>和<math>\delta</math>分别表示观测噪音和干预噪音的大小。-->与上述推导类似的推导首见于Hoel2013的文章中<ref name="hoel_2013">{{cite journal|last1=Hoel|first1=Erik P.|last2=Albantakis|first2=L.|last3=Tononi|first3=G.|title=Quantifying causal emergence shows that macro can beat micro|journal=Proceedings of the National Academy of Sciences|volume=110|issue=49|page=19790–19795|year=2013|url=https://doi.org/10.1073/pnas.1314922110}}</ref>,并在[[神经信息压缩器]]一文中<ref name="zhang_nis">{{cite journal|title=Neural Information Squeezer for Causal Emergence|first1=Jiang|last1=Zhang|first2=Kaiwei|last2=Liu|journal=Entropy|year=2022|volume=25|issue=1|page=26|url=https://api.semanticscholar.org/CorpusID:246275672}}</ref>中进行了详细讨论。
+
其中<math>\epsilon</math>和<math>\delta</math>分别表示观测噪音和干预噪音的大小。-->This kind of derivation was first seen in Hoel's 2013 paper [1] and was further discussed in detail in the "Neural Information Squeezer" paper [2].
 
===高维情况===
 
===高维情况===
 
我们可以把上述对一维变量的EI计算推广到更一般的n维情景。即:{{NumBlk|:|
 
我们可以把上述对一维变量的EI计算推广到更一般的n维情景。即:{{NumBlk|:|
第709行: 第723行:  
看来,连续变量系统的归一化问题并不能简单平移离散变量的结果。
 
看来,连续变量系统的归一化问题并不能简单平移离散变量的结果。
   −
在[[神经信息压缩器]](Neural information squeezer, NIS)的框架被提出时<ref name="zhang_nis" />,作者们发明了另一种对连续变量的有效信息进行归一化方式,即用状态空间维数来归一化EI,从而解决连续状态变量上的EI比较问题,这一指标被称为'''维度平均的有效信息'''(Dimension Averaged Effective Information,简称dEI)。其描述为:
+
在[[神经信息压缩器]](Neural information squeezer, NIS)的框架被提出时<ref name="zhang_nis">{{cite journal|title=Neural Information Squeezer for Causal Emergence|first1=Jiang|last1=Zhang|first2=Kaiwei|last2=Liu|journal=Entropy|year=2022|volume=25|issue=1|page=26|url=https://api.semanticscholar.org/CorpusID:246275672}}</ref>,作者们发明了另一种对连续变量的有效信息进行归一化方式,即用状态空间维数来归一化EI,从而解决连续状态变量上的EI比较问题,这一指标被称为'''维度平均的有效信息'''(Dimension Averaged Effective Information,简称dEI)。其描述为:
    
<math>
 
<math>
第969行: 第983行:  
\log_2\frac{suff}{1 - nec'}
 
\log_2\frac{suff}{1 - nec'}
 
</math>
 
</math>
|<ref name="hoel_2013" />
+
|<ref name="hoel_2013">{{cite journal|last1=Hoel|first1=Erik P.|last2=Albantakis|first2=L.|last3=Tononi|first3=G.|title=Quantifying causal emergence shows that macro can beat micro|journal=Proceedings of the National Academy of Sciences|volume=110|issue=49|page=19790–19795|year=2013|url=https://doi.org/10.1073/pnas.1314922110}}</ref>
 
|-
 
|-
 
|Galton度量
 
|Galton度量
1,117

个编辑

导航菜单