更改

Causal Emergence (查看源代码)

2024年10月30日 (三) 06:35的版本

添加27,752字节、 2024年10月30日 (星期三)

→‎Comparison of Several Causal Emergence Theories

第255行：第255行：

|Dynamic independence[40]||Granger causality||Requires specifying a coarse-graining method||Arbitrary dynamics||Dynamic independence: transfer entropy

|}

+

==Identification of Causal Emergence==

+

Some works on quantifying emergence through causal measures and other information-theoretic indicators have been introduced previously. However, in practical applications, we can often only collect observational data and cannot obtain the true dynamics of the system. Therefore, identifying whether causal emergence has occurred in a system from observable data is a more important problem. The following introduces two identification methods of causal emergence, including the approximate method based on Rosas causal emergence theory (the method based on mutual information approximation and the method based on machine learning) and the neural information compression (NIS, NIS+) method proposed by Chinese scholars.

+

====Approximate Method Based on Rosas Causal Emergence Theory====

+

Rosas's causal emergence theory includes a quantification method based on synergistic information and a quantification method based on unique information. The second method can bypass the combinatorial explosion problem of multivariate variables, but it depends on the coarse-graining method and the selection of macroscopic state variable <math>V</math>. To solve this problem, the author gives two solutions. One is to specify a macroscopic state <math>V</math> by the researcher, and the other is a machine learning-based method that allows the system to automatically learn the macroscopic state variable <math>V</math> by maximizing <math>\mathrm{\Psi}</math>. Now we introduce these two methods respectively:

+

=====Method Based on Mutual Information Approximation=====

+

Although Rosas's causal emergence theory has given a strict definition of causal emergence, it involves the combinatorial explosion problem of many variables in the calculation, so it is difficult to apply this method to actual systems. To solve this problem, Rosas et al. bypassed the exact calculation of unique information and synergistic information [37] and proposed an approximate formula that only needs to calculate mutual information, and derived a sufficient condition for determining the occurrence of causal emergence.

+

The authors proposed three new indicators based on mutual information, <math>\mathrm{\Psi}</math>, <math>\mathrm{\Delta}</math> and <math>\mathrm{\Gamma}</math>, which can be used to identify causal emergence, causal decoupling and downward causation in the system respectively. The specific calculation formulas of the three indicators are as follows:

+

* Indicator for judging causal emergence:

+

{{NumBlk|:|

+

<math>\Psi_{t, t + 1}(V):=I\left(V_t ; V_{t + 1}\right)-\sum_j I\left(X_t^j ; V_{t + 1}\right)</math>

+

|{{EquationRef|1}}}}

+

Here <math>X_t^j</math> represents the microscopic variable at time t in the j-th dimension, and <math>V_t ; V_{t + 1}</math> respectively represent macroscopic state variables at two consecutive times. Rosas et al. defined that when <math>\mathrm{\Psi}>0</math>, emergence occurs in the system; but when <math>\mathrm{\Psi}<0</math>, we cannot determine whether <math>V</math> has emergence, because this condition is only a sufficient condition for the occurrence of causal emergence.

+

* Indicator for judging downward causation:

+

<math>\Delta_{t, t + 1}(V):=\max _j\left(I\left(V_t ; X_{t + 1}^j\right)-\sum_i I\left(X_t^i ; X_{t + 1}^j\right)\right)</math>

+

When <math>\mathrm{\Delta}>0</math>, there is downward causation from macroscopic state <math>V</math> to microscopic variable <math>X</math>.

+

* Indicator for judging causal decoupling:

+

<math>\Gamma_{t, t + 1}(V):=\max _j I\left(V_t ; X_{t + 1}^j\right)</math>

+

When <math>\mathrm{\Delta}>0</math> and <math>\mathrm{\Gamma}=0</math>, causal emergence occurs in the system and there is causal decoupling.

+

The reason why we can use <math>\mathrm{\Psi}</math> to identify the occurrence of causal emergence is that <math>\mathrm{\Psi}</math> is also the lower bound of unique information. We have the following relationship:

+

<math>Un(V_t;X_{t + 1}|X_t)\geq I\left(V_t ; V_{t + 1}\right)-\sum_j I\left(X_t^j ; V_{t + 1}\right)+Red(V_t, V_{t + 1};X_t)</math>

+

Since <math>Red(V_t, V_{t + 1};X_t)</math> is non-negative, we can thus propose a sufficient but not necessary condition: when <math>\Psi_{t, t + 1}(V)>0</math>.

+

In summary, this method is relatively convenient to calculate because it is based on mutual information, and there is no assumption or requirement of Markov property for the dynamics of the system. However, this theory also has many shortcomings: 1) The three indicators proposed by this method: <math>\mathrm{\Psi}</math>, <math>\mathrm{\Delta}</math> and <math>\mathrm{\Gamma}</math> are only calculations based on mutual information and do not consider causality; 2) The method only obtains a sufficient condition for the occurrence of causal emergence; 3) This method depends on the selection of macroscopic variables, and different choices will have significantly different effects on the results; 4) When the system has a large amount of redundant information or many variables, the computational complexity of this method will be very high. At the same time, since <math>\Psi</math> is an approximate calculation, there will be a very large error in high-dimensional systems, and it is also very easy to obtain negative values, so it is impossible to judge whether there is causal emergence.

+

To verify that the information related to macaque movement is an emergent feature of its cortical activity, Rosas et al. did the following experiment: Using the electrocorticogram (ECoG) of macaques as the observational data of microscopic dynamics. To obtain the macroscopic state variable <math>V</math>, the authors chose the time series data of the limb movement trajectory of macaques obtained by motion capture (MoCap), where ECoG and MoCap are composed of data from 64 channels and 3 channels respectively. Since the original MoCap data does not satisfy the conditional independence assumption of the supervenience feature, they used partial least squares and support vector machine algorithms to infer the part of neural activity encoded in the ECoG signal related to predicting macaque behavior, and speculated that this information is an emergent feature of potential neural activity. Finally, based on the microscopic state and the calculated macroscopic features, the authors verified the existence of causal emergence.

+

=====Machine Learning-based Method=====

+

Kaplanis et al. [26] based on the theoretical method of representation learning, use an algorithm to spontaneously learn the macroscopic state variable <math>V</math> by maximizing <math>\mathrm{\Psi}</math> (i.e., Equation {{EquationNote|1}}). Specifically, the authors use a neural network <math>f_{\theta}</math> to learn the representation function that coarsens the microscopic input <math>X_t</math> into the macroscopic output <math>V_t</math>, and at the same time use neural networks <math>g_{\phi}</math> and <math>h_{\xi}</math> to learn the calculation of mutual information such as <math>I(V_t;V_{t + 1})</math> and <math>\sum_i(I(V_{t + 1};X_{t}^i))</math> respectively. Finally, this method optimizes the neural network by maximizing the difference between the two (i.e., <math>\mathrm{\Psi}</math>). The architecture diagram of this neural network system is shown in Figure a below.

+

[[文件:学习因果涌现表征的架构.png|居左|600x600像素|学习因果涌现表征的架构]]

+

Figure b shows a toy model example. The microscopic input <math>X_t(X_t^1,...,X_t^6) \in \{0,1\}^6</math> has six dimensions, and each dimension has two states of 0 and 1. <math>X_{t + 1}</math> is the output of <math>X_{t}</math> at the next moment. The macroscopic state is <math>V_{t}=\oplus_{i = 1}^{5}X_t^i</math>, where <math>\oplus_{i = 1}^{5}X_t^i</math> represents the result of adding the first five dimensions of the microscopic input <math>X_t</math> and taking the modulo 2. There is an equal <math>\gamma</math> probability that the macroscopic states at two consecutive moments are equal (<math>p(\oplus_{j = 1..5}X_{t + 1}^j=\oplus_{j = 1..5}X_t^j)= \gamma</math>). The sixth dimension of the microscopic input at two consecutive moments has an equal probability of <math>\gamma_{extra}</math> (<math>p(X_{t + 1}^6=X_t^6)= \gamma_{extra}</math>).

+

The results show that in the simple example shown in Figure (b), by maximizing <math>\mathrm{\Psi}</math> through the model constructed in Figure a, the experiment finds that the learned <math>\mathrm{\Psi}</math> is approximately equal to the true groundtruth <math>\mathrm{\Psi}</math>, verifying the effectiveness of model learning. This system can correctly judge the occurrence of causal emergence. However, this method also has the problem of being difficult to deal with complex multivariate situations. This is because the number of neural networks on the right side of the figure is proportional to the number of macroscopic and microscopic variable pairs. Therefore, the more the number of microscopic variables (dimensions), the more the number of neural networks will increase proportionally, which will lead to an increase in computational complexity. In addition, this method is only tested on very few cases, so it cannot be scaled up yet. Finally, more importantly, because the network calculates the approximate index of causal emergence and obtains a sufficient but not necessary condition for emergence, the various drawbacks of the above approximate algorithm will be inherited by this method.

+

====Neural Information Compression Method====

+

In recent years, emerging artificial intelligence technologies have overcome a series of major problems. At the same time, machine learning methods are equipped with various carefully designed neural network structures and automatic differentiation technologies, which can approximate any function in a huge function space. Therefore, Zhang Jiang et al. tried to propose a data-driven method based on neural networks to identify causal emergence from time series data [46][40]. This method can automatically extract effective coarse-graining strategies and macroscopic dynamics, overcoming various deficiencies of the Rosas method [37].

+

In this work, the input is time series data <math>(X_1,X_2,...,X_T )</math>, and <math>X_t\equiv (X_t^1,X_t^2,…,X_t^p )</math>, <math>p</math> represents the dimension of the input data. The author assumes that this set of data is generated by a general stochastic dynamical system:

+

+

Here [math]X(t)[/math] is the microscopic state variable, [math]f[/math] is the microscopic dynamics, and <math>\xi</math> represents the noise in the system dynamics and can model the random characteristics in the dynamical system. However, <math>f</math> is unknown.

+

The so-called causal emergence identification problem refers to such a functional optimization problem:

+

{{NumBlk|:|

+

<math>

+

\begin{aligned}&\max_{\phi,f_{q},\phi^{\dagger}}\mathcal{J}(f_{q}),\\&s.t.\begin{cases}\parallel\hat{X}_{t + 1}-X_{t + 1}\parallel<\epsilon,\\\hat{X}_{t + 1}=\phi^{\dagger}\left(f_{q}(\phi(X_{t})\bigr)\right).\end{cases}\end{aligned}

+

</math>

+

|{{EquationRef|2}}}}

+

Here, [math]\mathcal{J}[/math] is the dimension-averaged <math>EI</math> (see the entry effective information), <math>\mathrm{\phi}</math> is the coarse-graining strategy function, <math>f_{q}</math> is the macroscopic dynamics, <math>q</math> is the dimension of the coarsened macroscopic state, [math]\hat{X}_{t + 1}[/math] is the prediction of the microscopic state at time <math>t + 1</math> by the entire framework. This prediction is obtained by performing inverse coarse-graining operation (the inverse coarse-graining function is [math]\phi^{\dagger}[/math]) on the macroscopic state prediction <math>\hat{Y}_{t + 1}</math> at time <math>t + 1</math>. Here <math>\hat{Y}_{t + 1}\equiv f_q(Y_t)[/math] is the prediction of the macroscopic state at time <math>t + 1</math by the dynamics learner according to the macroscopic state <math>Y_t[/math] at time <math>t</math>, where <math>Y_t\equiv \phi(X_t)[/math] is the macroscopic state at time <math>t</math>, which is obtained by coarse-graining <math>X_t[/math] by <math>\phi[/math]. Finally, the difference between <math>\hat{X}_{t + 1}</math> and the real microscopic state data <math>X_{t + 1}</math> is compared to obtain the microscopic prediction error.

+

The entire optimization framework is shown below:

+

[[文件:NIS_Optimization.png|替代=NIS优化框架|居左|400x400像素|NIS优化框架]]

+

The objective function of this optimization problem is <math>EI</math>, which is a functional of the functions [math]\phi,\hat{f}_q,\phi^{\dagger}[/math] (here the macroscopic dimension [math]q[/math] is a hyperparameter), so it is difficult to optimize. We need to use machine learning methods to try to solve it.

+

=====NIS=====

+

To identify causal emergence in the system, the author proposes a neural information squeezer (NIS) neural network architecture [46]. This architecture is based on an encoder-dynamics learner-decoder framework, that is, the model consists of three parts, which are respectively used for coarse-graining the original data to obtain the macroscopic state, fitting the macroscopic dynamics and inverse coarse-graining operation (decoding the macroscopic state combined with random noise into the microscopic state). Among them, the authors use invertible neural network (INN) to construct the encoder (Encoder) and decoder (Decoder), which approximately correspond to the coarse-graining function [math]\phi[/math] and the inverse coarse-graining function [math]\phi^{\dagger}[/math] respectively. The reason for using invertible neural network is that we can simply invert this network to obtain the inverse coarse-graining function (i.e., [math]\phi^{\dagger}\approx \phi^{-1}[/math]). This model framework can be regarded as a neural information compressor. It puts the microscopic state data containing noise into a narrow information channel, compresses it into a macroscopic state, discards useless information, so that the causality of macroscopic dynamics is stronger, and then decodes it into a prediction of the microscopic state. The model framework of the NIS method is shown in the following figure:

+

[[文件:NIS模型框架图.png|居左|500x500像素|替代=NIS模型框架图|NIS模型框架图]]

+

Specifically, the encoder function [math]\phi[/math] consists of two parts:

+

<math>

+

\phi\equiv \chi\circ\psi

+

</math>

+

Here [math]\psi[/math] is an invertible function implemented by an invertible neural network, [math]\chi[/math] is a projection function, that is, removing the last <math>p - q</math> dimensional components from the <math>p</math>-dimensional vector. Here <math>p,q</math> are the dimensions of the microscopic state and macroscopic state respectively. <math>\circ</math> is the composition operation of functions.

+

The decoder is the function [math]\phi^{\dagger}[/math], which is defined as:

+

<math>

+

\phi^{\dagger}(y)\equiv \psi^{-1}(y\bigoplus z)

+

</math>

+

Here <math>z\sim\mathcal{Ν}\left (0,I_{p - q}\right )[/math] is a <math>p - q</math>-dimensional random vector that obeys the standard normal distribution.

+

However, if we directly optimize the dimension-averaged effective information, there will be certain difficulties. The article [46] does not directly optimize Equation {{EquationNote|1}}, but adopts a clever method. To solve this problem, the author divides the optimization process into two stages. The first stage is to minimize the microscopic state prediction error under the condition of a given macroscopic scale <math>q</math>, that is, <math>\min _{\phi, f_q, \phi^{\dagger}}\left\|\phi^{\dagger}(Y(t + 1)) - X_{t + 1}\right\|<\epsilon</math> and obtain the optimal macroscopic state dynamics <math>f_q^\ast[/math>; the second stage is to search for the hyperparameter <math>q</math> to maximize the effective information <math>\mathcal{J}[/math>, that is, <math>\max_{q}\mathcal{J}(f_{q}^\ast)</math>. Practice has proved that this method can effectively find macroscopic dynamics and coarse-graining functions, but it cannot truly maximize EI in advance.

+

In addition to being able to automatically identify causal emergence based on time series data, this framework also has good theoretical properties. There are two important theorems:

+

'''Theorem 1''': The information bottleneck of the neural information squeezer. That is, for any bijection <math>\mathrm{\psi}</math>, projection <math>\chi</math>, macroscopic dynamics <math>f</math> and Gaussian noise <math>z_{p - q}\sim\mathcal{Ν}\left (0,I_{p - q}\right )</math>,

+

<math>

+

I\left(Y_t;Y_{t + 1}\right)=I\left(X_t;{\hat{X}}_{t + 1}\right)

+

</math>

+

always holds. This means that all the information discarded by the encoder is actually noise information unrelated to prediction.

+

'''Theorem 2''': For a trained model, <math>I\left(X_t;{\hat{X}}_{t + 1}\right)\approx I\left(X_t;X_{t + 1}\right)</math>. Therefore, combining Theorem 1 and Theorem 2, we can obtain for a trained model:

+

<math>

+

I\left(Y_t;Y_{t + 1}\right)\approx I\left(X_t;X_{t + 1}\right)

+

</math>

+

======Comparison with Classical Theories======

+

The NIS framework has many similarities with the computational mechanics framework mentioned in the previous sections. NIS can be regarded as an <math>\epsilon</math>-machine. The set of all historical processes <math>\overleftarrow{S}</math> in computational mechanics can be regarded as a microscopic state. All <math>R \in \mathcal{R}</math> represent macroscopic states. The function <math>\eta</math> can be understood as a coarse-graining function. <math>\epsilon</math> can be understood as an effective coarse-graining strategy. <math>T</math> corresponds to effective macroscopic dynamics. The characteristic of minimum randomness index characterizes the determinism of macroscopic dynamics and can be replaced by effective information in causal emergence. When the entire framework is fully trained and can accurately predict the future microscopic state, the encoded macroscopic state converges to the effective state, and the effective state can be regarded as the causal state in computational mechanics.

+

At the same time, the NIS framework also has similarities with the G-emergence theory mentioned earlier. For example, NIS also adopts the idea of Granger causality: optimizing the effective macroscopic state by predicting the microscopic state at the next time step. However, there are several obvious differences between these two frameworks: a) In the G-emergence theory, the macroscopic state needs to be manually selected, while in NIS, the macroscopic state is obtained by automatically optimizing the coarse-graining strategy; b) NIS uses neural networks to predict future states, while G-emergence uses autoregressive techniques to fit the data.

+

======Computational Examples======

+

The author of NIS conducted experiments in the spring oscillator model, and the results are shown in the following figure. Figure a shows that the results of encoding at the next moment linearly coincide with the iterative results of macroscopic dynamics, verifying the effectiveness of the model. Figure b shows that the learned two dynamics and the real dynamics also coincide, further verifying the effectiveness of the model. Figure c is the multi-step prediction effect of the model. The prediction and the real curve are very close. Figure d shows the magnitude of causal emergence at different scales. It is found that causal emergence is most significant when the scale is 2, corresponding to the real spring oscillator model. Only two states (position and velocity) are needed to describe the entire system.

+

[[文件:弹簧振子模型1.png|居左|600x600像素|替代=弹簧振子模型1|弹簧振子模型]]

+

=====NIS+=====

+

Although NIS took the lead in proposing a scheme to optimize EI to identify causal emergence in data, this method has some shortcomings: the author divides the optimization process into two stages, but does not truly maximize the effective information, that is, Equation {{EquationNote|1}}. Therefore, Yang Mingzhe et al. [40] further improved this method and proposed the NIS+ scheme. By introducing reverse dynamics and reweighting technique, the original maximization of effective information is transformed into maximizing its variational lower bound by means of variational inequality to directly optimize the objective function.

+

======Mathematical Principles======

+

Specifically, according to variational inequality and inverse probability weighting method, the constrained optimization problem given by Equation {{EquationNote|2}} can be transformed into the following unconstrained minimization problem:

+

<math>\min_{\omega,\theta,\theta'} \sum_{i = 0}^{T - 1}w(\boldsymbol{x}_t)||\boldsymbol{y}_t-g_{\theta'}(\boldsymbol{y}_{t + 1})||+\lambda||\hat{\boldsymbol{x}}_{t + 1}-\boldsymbol{x}_{t + 1}||</math>

+

Here <math>g</math> is the reverse dynamics, which can be approximated by a neural network and trained by the data of [math]y_{t + 1},y_{t}[/math] through the macroscopic state. <math>w(x_t)</math> is the inverse probability weight, and the specific calculation method is as follows:

+

<math>

+

w(\boldsymbol{x}_t)=\frac{\tilde{p}(\boldsymbol{y}_t)}{p(\boldsymbol{y}_t)}=\frac{\tilde{p}(\phi(\boldsymbol{x}_t))}{p(\phi(\boldsymbol{x}_t))}

+

</math>

+

Here <math>\tilde{p}(\boldsymbol{y}_{t})</math> is the target distribution and <math>p(\boldsymbol{y}_{t})</math> is the original distribution of the data.

+

======Workflow and Model Architecture======

+

The following figure shows the entire model framework of NIS+. Figure a is the input of the model: time series data, which can be trajectory sequence, continuous image sequence and EEG time series data, etc.; Figure c is the output of the model, including the degree of causal emergence, macroscopic dynamics, emergent patterns and coarse-graining strategies; Figure b is the specific model architecture. Different from the NIS method, two parts of reverse dynamics and reweighting technology are added.

+

[[文件:NIS+.png|居左|600x600像素|替代=NIS模型框架图|NIS+模型框架图]]

+

======Case Analysis======

+

The article conducts experiments on different time series data sets, including the data generated by the disease transmission dynamical system model SIR dynamics, the bird flock model (Boids model) and the cellular automaton: Game of Life, as well as the fMRI signal data of the brain nervous system of real human subjects. Here we choose the bird flock and brain signals for experimental introduction and description.

+

The following figure shows the experimental results of NIS+ learning the flocking behavior of the Boids model. (a) and (e) give the actual and predicted trajectories of bird flocks under different conditions. Specifically, the author divides the bird flock into two groups and compares the multi-step prediction results under different noise levels (<math>\alpha</math> is 0.001 and 0.4 respectively). The prediction is good when the noise is relatively small, and the prediction curve will diverge when the noise is relatively large. (b) shows that the mean absolute error (MAE) of multi-step prediction gradually increases as the radius r increases. (c) shows the change of causal emergence measure <math>\Delta J</math> and prediction error (MAE) under different dimensions (q) with the change of training epoch. The author finds that causal emergence is most significant when the macroscopic state dimension q = 8. (d) is the attribution analysis of macroscopic variables to microscopic variables, and the obtained significance map intuitively describes the learned coarse-graining function. Here, each macroscopic dimension can correspond to the spatial coordinate (microscopic dimension) of each bird. The darker the color, the higher the correlation. Here, the microscopic coordinates corresponding to the maximum correlation of each macroscopic state dimension are highlighted with orange dots. These attribution significance values are obtained by using the Integrated Gradient (referred to as IG) method. The horizontal axis represents the x and y coordinates of 16 birds in the microscopic state, and the vertical axis represents 8 macroscopic dimensions. The light blue dotted line distinguishes the coordinates of different individual Boids, and the blue solid line separates the two bird flocks. (f) and (g) represent the changing trends of causal emergence measure <math>\Delta J</math> and normalized error MAE under different noise levels. (f) represents the influence of changes in external noise (that is, adding observation noise to microscopic data) on causal emergence. (g) represents the influence of internal noise (represented by <math>\alpha</math>, added by modifying the dynamics of the Boids model) on causal emergence. In (f) and (g), the horizontal line represents the threshold that violates the error constraint in Equation {{EquationNote|1}}. When the normalized MAE is greater than the threshold of 0.3, the constraint is violated and the result is unreliable.

+

This set of experiments shows that NIS+ can learn macroscopic states and coarse-graining strategies by maximizing EI. This maximization enhances the generalization ability of the model to situations beyond the range of training data. The learned macroscopic state effectively identifies the average group behavior and can be attributed to individual positions using the gradient integration method. In addition, the degree of causal emergence increases with the increase of external noise and decreases with the increase of internal noise. This observation result shows that the model can eliminate external noise through coarse-graining, but cannot reduce internal noise.

+

[[文件:NIS+ boids.png|居左|700x700像素|鸟群中的因果涌现]]

+

The brain experiment is based on real fMRI data, which is obtained by performing two sets of experiments on 830 human subjects. In the first group, the subjects were asked to perform a visual task of watching a short movie clip and the recording was completed. In the second group of experiments, they were asked to be in a resting state and the recording was completed. Due to the relatively high original dimension, the authors first reduced the original 14000-dimensional data to 100 dimensions by using the Schaefer atlas method, and each dimension corresponds to a brain region. After that, the authors learned these data through NIS+ and extracted the dynamics at six different macroscopic scales. Figure a shows the multi-step prediction error results at different scales. Figure b shows the comparison of EI of NIS and NIS+ methods on different macroscopic dimensions in the resting state and the visual task of watching movies. The authors found that in the visual task, causal emergence is most significant when the macroscopic state dimension is q = 1. Through attribution analysis, it is found that the visual area plays the largest role (Figure c), which is consistent with the real scene. Figure d shows different perspective views of brain region attribution. In the resting state, one macroscopic dimension is not enough to predict the microscopic time series data. The dimension with the largest causal emergence is between 3 and 7 dimensions.

+

[[文件:NIS+ 脑数据.png|居左|700x700像素|脑神经系统中的因果涌现]]

+

These experiments show that NIS+ can not only identify causal emergence in data, discover emergent macroscopic dynamics and coarse-graining strategies, but also other experiments show that the NIS+ model can also increase the out-of-distribution generalization ability of the model through EI maximization.

Complexivist Ran

150

个编辑