更改

Causal emergence refers to a special kind of emergence phenomenon in dynamical systems, that is, the system will exhibit stronger causal characteristics on a macroscopic scale. In particular, for a class of Markov dynamical systems, after appropriate coarse-graining of its state space, the formed macroscopic dynamics will exhibit stronger causal characteristics than the microscopic ones. Then it is said that causal emergence occurs in this system <ref name=":0">Hoel E P, Albantakis L, Tononi G. Quantifying causal emergence shows that macro can beat micro[J]. Proceedings of the National Academy of Sciences, 2013, 110(49): 19790-19795.</ref><ref name=":1">Hoel E P. When the map is better than the territory[J]. Entropy, 2017, 19(5): 188.</ref>. At the same time, the causal emergence theory is also a theory that uses causal effect measures to quantify emergence phenomena in complex systems.

+

== 1. History ==

第5行：第6行：

=== Development of related concepts ===

The causal emergence theory is a theory that attempts to answer the question of what emergence is from a phenomenological perspective using a causal-based quantitative research method. Therefore, the development of causal emergence is closely related to people's understanding and development of concepts such as emergence and causality.

+

==== Emergence ====

Emergence has always been an important characteristic in complex systems and a core concept in many discussions about system complexity and the relationship between the macroscopic and microscopic levels <ref>Meehl P E, Sellars W. The concept of emergence[J]. Minnesota studies in the philosophy of science, 1956, 1239-252.</ref><ref name=":7">Holland J H. Emergence: From chaos to order[M]. OUP Oxford, 2000.</ref>. Emergence can be simply understood as the whole being greater than the sum of its parts, that is, the whole exhibits new characteristics that the individuals constituting it do not possess <ref>Anderson P W. More is different: broken symmetry and the nature of the hierarchical structure of science[J]. Science, 1972, 177(4047): 393-396.</ref>. Although scholars have pointed out the existence of emergence phenomena in various fields <ref name=":7" /><ref>Holland, J.H. Hidden Order: How Adaptation Builds Complexity; Addison Wesley Longman Publishing Co., Inc.: Boston, MA, USA, 1996.</ref>, such as the collective behavior of birds <ref>Reynolds, C.W. Flocks, herds and schools: A distributed behavioral model. In Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, Anaheim, CA, USA, 27–31 July 1987; pp. 25–34.</ref>, the formation of consciousness in the brain, and the emergent capabilities of large language models <ref>Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. Emergent abilities of large language models. arXiv 2022, arXiv:2206.07682.</ref>, there is currently no universally accepted unified understanding of this phenomenon. Previous research on emergence mostly stayed at the qualitative stage. For example, Bedau et al. <ref name=":9">Bedau, M.A. Weak emergence. Philos. Perspect. 1997, 11, 375–399. [CrossRef] </ref><ref>Bedau, M. Downward causation and the autonomy of weak emergence. Principia Int. J. Epistemol. 2002, 6, 5–50. </ref> conducted classified research on emergence, dividing emergence into nominal emergence <ref name=":10">Harré, R. The Philosophies of Science; Oxford University Press: New York, NY, USA , 1985.</ref><ref name=":11">Baas, N.A. Emergence, hierarchies, and hyperstructures. In Artificial Life III, SFI Studies in the Science of Complexity, XVII; Routledge: Abingdon, UK, 1994; pp. 515–537.</ref>, weak emergence <ref name=":9" /><ref>Newman, D.V. Emergence and strange attractors. Philos. Sci. 1996, 63, 245–261. [CrossRef]</ref>, and strong emergence <ref name=":12">Kim, J. ‘Downward causation’ in emergentism and nonreductive physicalism. In Emergence or Reduction; Walter de Gruyter: Berlin, Germany, 1992; pp. 119–138. </ref><ref name=":13">O’Connor, T. Emergent properties. Am. Philos. Q. 1994, 31, 91–104</ref>.

+

Nominal emergence can be understood as attributes and patterns that can be possessed by the macroscopic level but not by the microscopic level. For example, the shape of a circle composed of several pixels is a kind of nominal emergence <ref name=":10" /><ref name=":11" />.

+

Weak emergence refers to the fact that macroscopic-level attributes or processes are generated by complex interactions between individual components. Or weak emergence can also be understood as a characteristic that can be simulated by a computer in principle. Due to the principle of computational irreducibility, even if weak emergence characteristics can be simulated, they still cannot be easily reduced to microscopic-level attributes. For weak emergence, the causes of its pattern generation may come from both microscopic and macroscopic levels <ref name=":12" /><ref name=":13" />. Therefore, the causal relationship of emergence may coexist with microscopic causal relationships.

+

As for strong emergence, there are many controversies. It refers to macroscopic-level attributes that cannot be reduced to microscopic-level attributes in principle, including the interactions between individuals. In addition, Jochen Fromm further interprets strong emergence as the causal effect of downward causation <ref>Fromm, J. Types and forms of emergence. arXiv 2005, arXiv:nlin/0506028</ref>. Downward causation refers to the causal force from the macroscopic level to the microscopic level. However, there are many controversies about the concept of downward causation itself <ref>Bedau, M.A.; Humphreys, P. Emergence: Contemporary Readings in Philosophy and Science; MIT Press: Cambridge, MA, USA, 2008. </ref><ref>Yurchenko, S.B. Can there be a synergistic core emerging in the brain hierarchy to control neural activity by downward causation? TechRxiv 2023 . [CrossRef] </ref>.

+

From these early studies, it can be seen that emergence has a natural and profound connection with causality.

+

==== Causality and its measurement ====

The so-called causality refers to the mutual influence between events. Causality is not equal to correlation, which is manifested in that not only will B occur when A occurs, but also if A does not occur, then B will not occur. Only by intervening in event A and then examining the result of B can people detect whether there is a causal relationship between A and B.

+

With the further development of causal science in recent years, people can use a mathematical framework to quantify causality. Causality describes the causal effect of a dynamical process <ref name=":14">Pearl J. Causality[M]. Cambridge university press, 2009.</ref><ref>Granger C W. Investigating causal relations by econometric models and cross-spectral methods[J]. Econometrica: journal of the Econometric Society, 1969, 424-438.</ref><ref name=":8">Pearl J. Models, reasoning and inference[J]. Cambridge, UK: CambridgeUniversityPress, 2000, 19(2).</ref>. Judea Pearl <ref name=":8" /> uses probabilistic graphical models to describe causal interactions. Pearl uses different models to distinguish and quantify three levels of causality. Here we are more concerned with the second level in the causal ladder: intervening in the input distribution. In addition, due to the uncertainty and ambiguity behind the discovered causal relationships, measuring the degree of causal effect between two variables is another important issue. Many independent historical studies have addressed the issue of measuring causal relationships. These measurement methods include Hume's concept of constant connection <ref>Spirtes, P.; Glymour, C.; Scheines, R. Causation Prediction and Search, 2nd ed.; MIT Press: Cambridge, MA, USA, 2000.</ref> and value function-based methods <ref>Chickering, D.M. Learning equivalence classes of Bayesian-network structures. J. Mach. Learn. Res. 2002, 2, 445–498.</ref>, Eells and Suppes' probabilistic causal measures <ref>Eells, E. Probabilistic Causality; Cambridge University Press: Cambridge, UK, 1991; Volume 1</ref><ref>Suppes, P. A probabilistic theory of causality. Br. J. Philos. Sci. 1973, 24, 409–410.</ref>, and Judea Pearl's causal measure indicators, etc. <ref name=":14" />.

+

==== Causal emergence ====

As mentioned earlier, emergence and causality are interconnected. Specifically, the connection exists in the following aspects: on the one hand, emergence is the causal effect of complex nonlinear interactions among the components of a complex system; on the other hand, the emergent properties will also have a causal effect on individual elements in complex systems. In addition, in the past, people were accustomed to attributing macroscopic factors to the influence of microscopic factors. However, macroscopic emergent patterns often cannot find microscopic attributions, so corresponding causes cannot be found. Thus, there is a profound connection between emergence and causality. Moreover, although we have a qualitative classification of emergence, we cannot quantitatively characterize the occurrence of emergence. Therefore, we can use causality to quantitatively characterize the occurrence of emergence.

+

In 2013, Erik Hoel, an American theoretical neurobiologist, tried to introduce causality into the measurement of emergence, proposed the concept of causal emergence, and used effective information (EI for short) to quantify the strength of causality in system dynamics <ref name=":0" /><ref name=":1" />. '''Causal emergence can be described as: when a system has a stronger causal effect on a macroscopic scale compared to its microscopic scale, causal emergence occurs.''' Causal emergence well characterizes the differences and connections between the macroscopic and microscopic states of a system. At the same time, it combines the two core concepts - causality in artificial intelligence and emergence in complex systems - together. Causal emergence also provides scholars with a quantitative perspective to answer a series of philosophical questions. For example, the top-down causal characteristics in life systems or social systems can be discussed with the help of the causal emergence framework. The top-down causation here refers to downward causation [26], indicating the existence of macroscopic-to-microscopic causal effects. For example, in the phenomenon of a gecko breaking its tail. When encountering danger, the gecko directly breaks off its tail regardless of its condition. Here, the whole is the cause and the tail is the effect. Then there is a causal force from the whole pointing to the part.

+

=== Early work on quantifying emergence ===

第72行：第83行：

Specifically, if we use a binary autoregressive model for prediction, when there are only two variables A and B, there are two equations in the autoregressive model, each equation corresponds to one of the variables, and the current value of each variable is composed of its own value and the value of the other variable within a certain time lag range. In addition, the model also calculates residuals. Here, residuals can be understood as prediction errors and can be used to measure the degree of Granger causal effect (called G-causality) of each equation. The degree to which B is a Granger cause (G-cause) of A is calculated by taking the logarithm of the ratio of the two residual variances, one being the residual of A's autoregressive model when B is omitted, and the other being the residual of the full prediction model (including A and B). In addition, the author also defines the concept of “G-autonomous”, which represents a measure of the extent to which the past values of a time series can predict its own future values. The strength of this autonomous predictive causal effect can be characterized in a similar way to G-causality.

+

[[文件:G Emergence Theory.png|G-emergence理论图|alt=G-emergence理论图|居左|400x300像素]]

+

As shown in the above figure, we can judge the occurrence of emergence based on the two basic concepts in the above G-causality (here is the measure of emergence based on Granger causality, denoted as G-emergence). If A is understood as a macroscopic variable and B is understood as a microscopic variable. The conditions for emergence to occur include two: 1) A is G-autonomous with respect to B; 2) B is a G-cause of A. The degree of G-emergence is calculated by multiplying the degree of A's G-autonomous by the degree of B's average G-cause.

第89行：第102行：

Another method is to understand emergence from the perspective of "the whole is greater than the sum of its parts" <ref>Teo, Y.M.; Luong, B.L.; Szabo, C. Formalization of emergence in multi-agent systems. In Proceedings of the 1st ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, Montreal, QC, Canada, 19–22 May 2013; pp. 231–240. </ref><ref>Szabo, C.; Teo, Y.M. Formalization of weak emergence in multiagent systems. ACM Trans. Model. Comput. Simul. (TOMACS) 2015, 26, 1–25. [CrossRef] </ref>. This method defines emergence from interaction rules and the states of agents rather than statistically measuring from the totality of the entire system. Specifically, this measure consists of subtracting two terms. The first term describes the collective state of the entire system, while the second term represents the sum of the individual states of all components. This measure emphasizes that emergence arises from the interactions and collective behavior of the system.

+

=== Causal emergence theory based on effective information ===

In history, the first relatively complete and explicit quantitative theory that uses causality to define emergence is the causal emergence theory proposed by Erik Hoel, Larissa Albantakis, and Giulio Tononi <ref name=":0" /><ref name=":1" />. This theory defines so-called causal emergence for Markov chains as the phenomenon that the coarsened Markov chain has a greater causal effect strength than the original Markov chain. Here, the causal effect strength is measured by effective information. This indicator is a modification of the mutual information indicator. The main difference is that the state variable at time $t$ is intervened by do-intervention and transformed into a uniform distribution (or maximum entropy distribution). The effective information indicator was proposed by Giulio Tononi as early as 2003 when studying integrated information theory. As Giulio Tononi's student, Erik Hoel applied effective information to Markov chains and proposed the causal emergence theory based on effective information.

+

=== Causal Emergence Theory Based on Information Decomposition ===

In addition, in 2020, Rosas et al. <ref name=":5" /> proposed a method based on information decomposition to define causal emergence in systems from an information theory perspective and quantitatively characterize emergence based on synergistic information or redundant information. The so-called information decomposition is a new method to analyze the complex interrelationships of various variables in complex systems. By decomposing information, each partial information is represented by an information atom. At the same time, each partial information is projected into the information atom with the help of an information lattice diagram. Both synergistic information and redundant information can be represented by corresponding information atoms. This method is based on the non-negative decomposition theory of multivariate information proposed by Williams and Beer <ref name=":16">Williams P L, Beer R D. Nonnegative decomposition of multivariate information[J]. arXiv preprint arXiv:10042515, 2010.</ref>. In the paper, partial information decomposition (PID) is used to decompose the mutual information between microstates and macrostates. However, the PID framework can only decompose the mutual information between multiple source variables and one target variable. Rosas extended this framework and proposed the integrated information decomposition method <math>\Phi ID</math> <ref name=":18">P. A. Mediano, F. Rosas, R. L. Carhart-Harris, A. K. Seth, A. B. Barrett, Beyond integrated information: A taxonomy of information dynamics phenomena, arXiv preprint arXiv:1909.02297 (2019).</ref>.

+

=== Recent Work ===

Barnett et al. <ref name=":6">Barnett L, Seth AK. Dynamical independence: discovering emergent macroscopic processes in complex dynamical systems. Physical Review E. 2023 Jul;108(1):014304.</ref> proposed the concept of dynamical decoupling by judging the decoupling of macroscopic and microscopic dynamics based on transfer entropy to judge the occurrence of emergence. That is, emergence is characterized as the macroscopic variables and microscopic variables being independent of each other and having no causal relationship, which can also be regarded as a causal emergence phenomenon.

+

In 2024, Zhang Jiang et al. <ref name=":2">Zhang J, Tao R, Yuan B. Dynamical Reversibility and A New Theory of Causal Emergence. arXiv preprint arXiv:2402.15054. 2024 Feb 23.</ref> proposed a new causal emergence theory based on singular value decomposition. The core idea of this theory is to point out that the so-called causal emergence is actually equivalent to the emergence of dynamical reversibility. Given the Markov transition matrix of a system, by performing singular value decomposition on it, the sum of the <math>\alpha</math> power of the singular values is defined as the reversibility measure of Markov dynamics (<math>\Gamma_{\alpha}\equiv \sum_{i=1}^N\sigma_i^{\alpha}</math>), where [math]\sigma_i[/math] is the singular value. This index is highly correlated with effective information and can also be used to characterize the causal effect strength of dynamics. According to the spectrum of singular values, this method can directly define the concepts of '''clear emergence''' and '''vague emergence''' without explicitly defining a coarse-graining scheme.

+

== Quantification of causal emergence ==

第111行：第129行：

==== Erik Hoel's causal emergence theory ====

In 2013, Hoel et al. <ref name=":0" /><ref name=":1" /> proposed the causal emergence theory. The following figure is an abstract framework for this theory. The horizontal axis represents time and the vertical axis represents scale. This framework can be regarded as a description of the same dynamical system on both microscopic and macroscopic scales. Among them, [math]f_m[/math] represents microscopic dynamics, [math]f_M[/math] represents macroscopic dynamics, and the two are connected by a coarse-graining function [math]\phi[/math]. In a discrete-state Markov dynamical system, both [math]f_m[/math] and [math]f_M[/math] are Markov chains. By performing coarse-graining of the Markov chain on [math]f_m[/math], [math]f_M[/math] can be obtained. <math> EI </math> is a measure of effective information. Since the microscopic state may have greater randomness, which leads to relatively weak causality of microscopic dynamics, by performing reasonable coarse-graining on the microscopic state at each moment, it is possible to obtain a macroscopic state with stronger causality. The so-called causal emergence refers to the phenomenon that when we perform coarse-graining on the microscopic state, the effective information of macroscopic dynamics will increase, and the difference in effective information between the macroscopic state and the microscopic state is defined as the intensity of causal emergence.

+

[[文件:因果涌现理论.png|因果涌现理论框架|alt=因果涌现理论抽象框架|居左|400x400像素]]

+

===== Effective Information =====

第122行：第142行：

In a Markov chain, the state variable [math]X_t[/math] at any time can be regarded as the cause, and the state variable [math]X_{t + 1}[/math] at the next time can be regarded as the result. Thus, the state transition matrix of the Markov chain is its causal mechanism. Therefore, the calculation formula for <math>EI</math> for a Markov chain is as follows:

+

<math>

第129行：第150行：

\end{aligned}

</math>

+

Here <math>f</math> represents the state transition matrix of a Markov chain, [math]U(\mathcal{X})[/math] represents the uniform distribution on the value space [math]\mathcal{X}[/math] of the state variable [math]X_t[/math]. <math>\tilde{X}_t,\tilde{X}_{t+1}</math> are the states at two consecutive moments after intervening [math]X_t[/math] at time <math>t</math> into a uniform distribution. <math>p_{ij}</math> is the transition probability from the <math>i</math>-th state to the <math>j</math>-th state. From this formula, it is not difficult to see that <math> EI </math> is only a function of the probability transition matrix [math]f[/math]. The intervention operation is performed to make the effective information objectively measure the causal characteristics of the dynamics without being affected by the distribution of the original input data.

第138行：第160行：

=====Causal Emergence Measurement=====

We can judge the occurrence of causal emergence by comparing the magnitudes of effective information of macroscopic and microscopic dynamics in the system:

+

<math>

CE = EI\left ( f_M \right ) - EI\left (f_m \right )

</math>

+

Here <math>CE</math> is the causal emergence intensity. If the effective information of macroscopic dynamics is greater than that of microscopic dynamics (that is, <math>CE>0</math>), then we consider that macroscopic dynamics has causal emergence characteristics on the basis of this coarse-graining.

第151行：第175行：

The coarse-graining of this matrix is as follows: First, merge the first 7 states into a macroscopic state, which may be called A. And sum up the probability values in the first 7 columns of the first 7 rows in [math]f_m[/math] to obtain the probability of state transition from macroscopic state A to state A, and keep other values of the [math]f_m[/math] matrix unchanged. The new probability transition matrix after merging is shown in the right figure, denoted as [math]f_M[/math]. This is a definite macroscopic Markov transition matrix, that is, the future state of the system can be completely determined by the current state. At this time <math>EI(f_M)>EI(f_m)</math>, and causal emergence occurs in the system.

+

[[文件:状态空间中的因果涌现1.png|居左|500x500像素|状态空间上的因果涌现|替代=]]

+

However, for more general Markov chains and more general state groupings, this simple operation of averaging probabilities is not always feasible. This is because the merged probability transition matrix may not satisfy the conditions of a Markov chain (such as the rows of the matrix not satisfying the normalization condition, or the element values exceeding the range of [0, 1]). For what kind of Markov chains and state groupings can a feasible macroscopic Markov chain be obtained, please refer to the section “Reduction of Markov Chains” later in this entry, or see the entry “Coarse-graining of Markov Chains”.

第165行：第191行：

Through comparison, we find that the effective information of macroscopic dynamics is greater than that of microscopic dynamics <math>EI(f_M\ )>EI(f_m\ ) </math>. Causal emergence occurs in this system.

+

[[文件:含有4个节点的布尔网络.png|居左|700x700像素|离散布尔网络上的因果涌现|替代=含有4个节点布尔网络的因果涌现]]

+

=====Causal Emergence in Continuous Variables=====

第174行：第202行：

====Rosas's Causal Emergence Theory====

Rosas et al. <ref name=":5" /> From the perspective of [[information decomposition]] theory, propose a method for defining causal emergence based on [[integrated information decomposition]], and further divide causal emergence into two parts: [[causal decoupling]] (Causal Decoupling) and [[downward causation]] (Downward Causation). Among them, causal decoupling represents the causal effect of the macroscopic state at the current moment on the macroscopic state at the next moment, and downward causation represents the causal effect of the macroscopic state at the previous moment on the microscopic state at the next moment. The schematic diagrams of causal decoupling and downward causation are shown in the following figure. The microscopic state input is <math>X_t\ (X_t^1,X_t^2,…,X_t^n ) </math>, and the macroscopic state is <math>V_t </math>, which is obtained by coarse-graining the microscopic state variable <math>X_t </math>, so it is a supervenient feature of <math>X_t </math>, <math>X_{t + 1} </math> and <math>V_{t + 1} </math> represent the microscopic and macroscopic states at the next moment respectively.

+

[[文件:向下因果与因果解耦2.png|居左|300x300像素|因果解耦与向下因果]]

+

=====Partial Information Decomposition=====

第182行：第212行：

Without loss of generality, assume that our microstate is <math>X(X^1,X^2)</math>, that is, it is a two-dimensional variable, and the macrostate is <math>V</math>. Then the mutual information between the two can be decomposed into four parts:

+

+

Among them, <math>Red(X^1,X^2;V)</math> represents redundant information, which refers to the information repeatedly provided by two microstates <math>X^1</math> and <math>X^2</math> to the macrostate <math>V</math>; <math>Un(X^1;V│X^2)</math> and <math>Un(X^2;V│X^1)</math> represent unique information, which refers to the information provided by each microstate variable alone to the macrostate; <math>Syn(X^1,X^2;V)</math> represents synergistic information, which refers to the information provided by all microstates <math>X</math> jointly to the macrostate <math>V</math>.

第193行：第225行：

1) When the unique information <math>Un(V_t;X_{t+1}| X_t^1,\ldots,X_t^n\ )>0 </math>, it means that the macroscopic state <math>V_t</math> at the current moment can provide more information to the overall system <math>X_{t + 1}</math> at the next moment than the microscopic state <math>X_t</math> at the current moment. At this time, there is causal emergence in the system;

+

2) The second method bypasses the selection of a specific macroscopic state <math>V_t</math>, and defines causal emergence only based on the synergistic information between the microscopic state <math>X_t</math> and the microscopic state <math>X_{t + 1}</math> at the next moment of the system. When the synergistic information <math>Syn(X_t^1,…,X_t^n;X_{t + 1}^1,…,X_{t + 1}^n)>0</math>, causal emergence occurs in the system.

第201行：第234行：

=====Specific Example=====

+

[[文件:因果解耦以及向下因果例子1.png|500x500像素|居左|因果解耦以及向下因果例子]]

+

The author of the paper <ref name=":5" /> lists a specific example (as above), to illustrate when causal decoupling, downward causation and causal emergence occur. This example is a special Markov process. Here, <math>p_{X_{t + 1}|X_t}(x_{t + 1}|x_t)</math> represents the dynamic relationship, and <math>X_t=(x_t^1,…,x_t^n)\in\{0,1\}^n</math> is the microstate. The definition of this process is to determine the probability of taking different values of the state <math>x_{t + 1}</math> at the next moment by checking the values of the variables <math>x_t</math> and <math>x_{t + 1}</math> at two consecutive moments, that is, judging whether the sum modulo 2 of all dimensions of <math>x_t</math> is the same as the first dimension of <math>x_{t + 1}</math>: if they are different, the probability is 0; otherwise, judge whether <math>x_t,x_{t + 1}</math> have the same sum modulo 2 value in all dimensions. If both conditions are satisfied, the value probability is <math>\gamma/2^{n - 2}</math>, otherwise the value probability is <math>(1-\gamma)/2^{n - 2}</math>. Here <math>\gamma</math> is a parameter and <math>n</math> is the total dimension of x.

第219行：第254行：

=====Singular Value Decomposition of Markov Chains=====

Given the Markov transition matrix <math>P</math> of a system, we can perform singular value decomposition on it to obtain two orthogonal and normalized matrices <math>U</math> and <math>V</math>, and a diagonal matrix <math>\Sigma</math>: <math>P = U\Sigma V^T</math>, where [math]\Sigma = diag(\sigma_1,\sigma_2,\cdots,\sigma_N)[/math], where [math]\sigma_1\geq\sigma_2\geq\cdots\sigma_N[/math] are the singular values of <math>P</math> and are arranged in descending order. <math>N</math> is the number of states of <math>P</math>.

+

=====Approximate Dynamical Reversibility and Effective Information=====

+

We can define the sum of the <math>\alpha</math> powers of the singular values (also known as the [math]\alpha[/math]-order Schatten norm of the matrix) as a measure of the approximate dynamical reversibility of the Markov chain, that is:

−

We can define the sum of the <math>\alpha</math> powers of the singular values (also known as the [math]\alpha[/math]-order Schatten norm of the matrix) as a measure of the approximate dynamical reversibility of the Markov chain, that is:

<math>

\Gamma_{\alpha}\equiv \sum_{i = 1}^N\sigma_i^{\alpha}

</math>

+

Here, [math]\alpha\in(0,2)[/math] is a specified parameter that acts as a weight or tendency to make [math]\Gamma_{\alpha}[/math] reflect determinism or degeneracy more. Under normal circumstances, we take [math]\alpha = 1[/math], which can make [math]\Gamma_{\alpha}[/math] achieve a balance between determinism and degeneracy.

第232行：第269行：

In addition, the authors in the literature prove that there is an approximate relationship between <math>EI</math> and [math]\Gamma_{\alpha}[/math]:

+

<math>

EI\sim \log\Gamma_{\alpha}

</math>

+

Moreover, to a certain extent, [math]\Gamma_{\alpha}[/math] can be used instead of EI to measure the degree of causal effect of Markov chains. Therefore, the so-called causal emergence can also be understood as an '''emergence of dynamical reversibility'''.

第242行：第281行：

=====Quantification of Causal Emergence without Coarse-graining=====

However, the greatest value of this theory lies in the fact that emergence can be directly quantified without a coarse-graining strategy. If the rank of <math>P</math> is <math>r</math>, that is, starting from the <math>r + 1</math>th singular value, all singular values are 0, then we say that the dynamics <math>P</math> has '''clear causal emergence''', and the numerical value of causal emergence is:

+

<math>

\Delta \Gamma_{\alpha} = \Gamma_{\alpha}(1/r - 1/N)

</math>

+

If the matrix <math>P</math> is full rank, but for any given small number <math>\epsilon</math>, there exists <math>r_{\epsilon}</math> such that starting from <math>r_{\epsilon}+1</math>, all singular values are less than <math>\epsilon</math>, then it is said that the system has a degree of '''vague causal emergence''', and the numerical value of causal emergence is:

+

<math>\Delta \Gamma_{\alpha}(\epsilon) = \frac{\sum_{i = 1}^{r} \sigma_{i}^{\alpha}}{r} - \frac{\sum_{i = 1}^{N} \sigma_{i}^{\alpha}}{N} </math>

+

In summary, the advantage of this method for quantifying causal emergence is that it can quantify causal emergence more objectively without relying on a specific coarse-graining strategy. The disadvantage of this method is that to calculate [math]\Gamma_{\alpha}[/math], it is necessary to perform SVD decomposition on <math>P</math> in advance, so the computational complexity is [math]O(N^3)[/math], which is higher than the computational complexity of <math>EI</math>. Moreover, [math]\Gamma_{\alpha}[/math> cannot be explicitly decomposed into two components: determinism and degeneracy.

+

=====Specific Example=====

+

[[文件:Gamma例子.png|居左|500x500像素|<math>EI</math>与<math>\Gamma</math>对比]]

+

The author gives four specific examples of Markov chains. The state transition matrix of this Markov chain is shown in the figure. We can compare the <math>EI</math> and approximate dynamical reversibility (the <math>\Gamma</math> in the figure, that is, <math>\Gamma_{\alpha = 1}</math>) of this Markov chain. Comparing figures a and b, we find that for different state transition matrices, when <math>EI</math> decreases, <math>\Gamma</math> also decreases simultaneously. Further, figures c and d are comparisons of the effects before and after coarse-graining. Among them, figure d is the coarse-graining of the state transition matrix of figure c (merging the first three states into a macroscopic state). Since the macroscopic state transition matrix in figure d is a deterministic system, the normalized <math>EI</math>, <math>eff\equiv EI/\log N</math> and the normalized [math]\Gamma[/math]: <math>\gamma\equiv \Gamma/N</math> all reach the maximum value of 1.

第266行：第312行：

=====Quantification of dynamic independence=====

Transfer entropy is a non-parametric statistic that measures the amount of directed (time-asymmetric) information transfer between two stochastic processes. The transfer entropy from process <math>X</math> to another process <math>Y</math> can be defined as the degree to which knowing the past values of <math>X</math> can reduce the uncertainty about the future value of <math>Y</math> given the past values of <math>Y</math>. The formula is as follows:

+

+

Here, <math>Y_t</math> represents the macroscopic variable at time <math>t</math>, and <math>X^-_t</math> and <math>Y^-_t</math> represent the microscopic and macroscopic variables before time <math>t</math> respectively. [math]I[/math] is mutual information and [math]H[/math] is Shannon entropy. <math>Y</math> is dynamically decoupled with respect to <math>X</math> if and only if the transfer entropy from <math>X</math> to <math>Y</math> at time <math>t</math> is <math>T_t(X \to Y)=0</math>.

第284行：第332行：

===Comparison of Several Causal Emergence Theories===

We can compare the above four different quantitative causal emergence theories from several different dimensions such as whether causality is considered, whether a coarse-graining function needs to be specified, the applicable dynamical systems, and quantitative indicators, and obtain the following table:

+

{| class="wikitable"

第298行：第347行：

|Dynamic independence <ref name=":6"/>||Granger causality||Requires specifying a coarse-graining method||Arbitrary dynamics||Dynamic independence: transfer entropy

|}

+

==Identification of Causal Emergence==

Some works on quantifying emergence through causal measures and other information-theoretic indicators have been introduced previously. However, in practical applications, we can often only collect observational data and cannot obtain the true dynamics of the system. Therefore, identifying whether causal emergence has occurred in a system from observable data is a more important problem. The following introduces two identification methods of causal emergence, including the approximate method based on Rosas causal emergence theory (the method based on mutual information approximation and the method based on machine learning) and the neural information compression (NIS, NIS+) method proposed by Chinese scholars.

+

====Approximate Method Based on Rosas Causal Emergence Theory====

+

Rosas's causal emergence theory includes a quantification method based on synergistic information and a quantification method based on unique information. The second method can bypass the combinatorial explosion problem of multivariate variables, but it depends on the coarse-graining method and the selection of macroscopic state variable <math>V</math>. To solve this problem, the author gives two solutions. One is to specify a macroscopic state <math>V</math> by the researcher, and the other is a machine learning-based method that allows the system to automatically learn the macroscopic state variable <math>V</math> by maximizing <math>\mathrm{\Psi}</math>. Now we introduce these two methods respectively:

−

Rosas's causal emergence theory includes a quantification method based on synergistic information and a quantification method based on unique information. The second method can bypass the combinatorial explosion problem of multivariate variables, but it depends on the coarse-graining method and the selection of macroscopic state variable <math>V</math>. To solve this problem, the author gives two solutions. One is to specify a macroscopic state <math>V</math> by the researcher, and the other is a machine learning-based method that allows the system to automatically learn the macroscopic state variable <math>V</math> by maximizing <math>\mathrm{\Psi}</math>. Now we introduce these two methods respectively:

=====Method Based on Mutual Information Approximation=====

−

Although Rosas's causal emergence theory has given a strict definition of causal emergence, it involves the combinatorial explosion problem of many variables in the calculation, so it is difficult to apply this method to actual systems. To solve this problem, Rosas et al. bypassed the exact calculation of unique information and synergistic information ~~[37]~~ and proposed an approximate formula that only needs to calculate mutual information, and derived a sufficient condition for determining the occurrence of causal emergence.

+

Although Rosas's causal emergence theory has given a strict definition of causal emergence, it involves the combinatorial explosion problem of many variables in the calculation, so it is difficult to apply this method to actual systems. To solve this problem, Rosas et al. bypassed the exact calculation of unique information and synergistic information <ref name=":5" /> and proposed an approximate formula that only needs to calculate mutual information, and derived a sufficient condition for determining the occurrence of causal emergence.

+

The authors proposed three new indicators based on mutual information, <math>\mathrm{\Psi}</math>, <math>\mathrm{\Delta}</math> and <math>\mathrm{\Gamma}</math>, which can be used to identify causal emergence, causal decoupling and downward causation in the system respectively. The specific calculation formulas of the three indicators are as follows:

+

* Indicator for judging causal emergence:

第315行：第368行：

<math>\Psi_{t, t + 1}(V):=I\left(V_t ; V_{t + 1}\right)-\sum_j I\left(X_t^j ; V_{t + 1}\right)</math>

|{{EquationRef|1}}}}

+

Here <math>X_t^j</math> represents the microscopic variable at time t in the j-th dimension, and <math>V_t ; V_{t + 1}</math> respectively represent macroscopic state variables at two consecutive times. Rosas et al. defined that when <math>\mathrm{\Psi}>0</math>, emergence occurs in the system; but when <math>\mathrm{\Psi}<0</math>, we cannot determine whether <math>V</math> has emergence, because this condition is only a sufficient condition for the occurrence of causal emergence.

+

* Indicator for judging downward causation:

+

<math>\Delta_{t, t + 1}(V):=\max _j\left(I\left(V_t ; X_{t + 1}^j\right)-\sum_i I\left(X_t^i ; X_{t + 1}^j\right)\right)</math>

+

When <math>\mathrm{\Delta}>0</math>, there is downward causation from macroscopic state <math>V</math> to microscopic variable <math>X</math>.

+

* Indicator for judging causal decoupling:

+

<math>\Gamma_{t, t + 1}(V):=\max _j I\left(V_t ; X_{t + 1}^j\right)</math>

+

When <math>\mathrm{\Delta}>0</math> and <math>\mathrm{\Gamma}=0</math>, causal emergence occurs in the system and there is causal decoupling.

+

The reason why we can use <math>\mathrm{\Psi}</math> to identify the occurrence of causal emergence is that <math>\mathrm{\Psi}</math> is also the lower bound of unique information. We have the following relationship:

+

<math>Un(V_t;X_{t + 1}|X_t)\geq I\left(V_t ; V_{t + 1}\right)-\sum_j I\left(X_t^j ; V_{t + 1}\right)+Red(V_t, V_{t + 1};X_t)</math>

+

Since <math>Red(V_t, V_{t + 1};X_t)</math> is non-negative, we can thus propose a sufficient but not necessary condition: when <math>\Psi_{t, t + 1}(V)>0</math>.

+

In summary, this method is relatively convenient to calculate because it is based on mutual information, and there is no assumption or requirement of Markov property for the dynamics of the system. However, this theory also has many shortcomings: 1) The three indicators proposed by this method: <math>\mathrm{\Psi}</math>, <math>\mathrm{\Delta}</math> and <math>\mathrm{\Gamma}</math> are only calculations based on mutual information and do not consider causality; 2) The method only obtains a sufficient condition for the occurrence of causal emergence; 3) This method depends on the selection of macroscopic variables, and different choices will have significantly different effects on the results; 4) When the system has a large amount of redundant information or many variables, the computational complexity of this method will be very high. At the same time, since <math>\Psi</math> is an approximate calculation, there will be a very large error in high-dimensional systems, and it is also very easy to obtain negative values, so it is impossible to judge whether there is causal emergence.

+

To verify that the information related to macaque movement is an emergent feature of its cortical activity, Rosas et al. did the following experiment: Using the electrocorticogram (ECoG) of macaques as the observational data of microscopic dynamics. To obtain the macroscopic state variable <math>V</math>, the authors chose the time series data of the limb movement trajectory of macaques obtained by motion capture (MoCap), where ECoG and MoCap are composed of data from 64 channels and 3 channels respectively. Since the original MoCap data does not satisfy the conditional independence assumption of the supervenience feature, they used partial least squares and support vector machine algorithms to infer the part of neural activity encoded in the ECoG signal related to predicting macaque behavior, and speculated that this information is an emergent feature of potential neural activity. Finally, based on the microscopic state and the calculated macroscopic features, the authors verified the existence of causal emergence.

+

=====Machine Learning-based Method=====

+

Kaplanis et al. <ref name=":2" /> based on the theoretical method of representation learning, use an algorithm to spontaneously learn the macroscopic state variable <math>V</math> by maximizing <math>\mathrm{\Psi}</math> (i.e., Equation {{EquationNote|1}}). Specifically, the authors use a neural network <math>f_{\theta}</math> to learn the representation function that coarsens the microscopic input <math>X_t</math> into the macroscopic output <math>V_t</math>, and at the same time use neural networks <math>g_{\phi}</math> and <math>h_{\xi}</math> to learn the calculation of mutual information such as <math>I(V_t;V_{t + 1})</math> and <math>\sum_i(I(V_{t + 1};X_{t}^i))</math> respectively. Finally, this method optimizes the neural network by maximizing the difference between the two (i.e., <math>\mathrm{\Psi}</math>). The architecture diagram of this neural network system is shown in Figure a below.

−

Kaplanis et al. [26] based on the theoretical method of representation learning, use an algorithm to spontaneously learn the macroscopic state variable <math>V</math> by maximizing <math>\mathrm{\Psi}</math> (i.e., Equation {{EquationNote|1}}). Specifically, the authors use a neural network <math>f_{\theta}</math> to learn the representation function that coarsens the microscopic input <math>X_t</math> into the macroscopic output <math>V_t</math>, and at the same time use neural networks <math>g_{\phi}</math> and <math>h_{\xi}</math> to learn the calculation of mutual information such as <math>I(V_t;V_{t + 1})</math> and <math>\sum_i(I(V_{t + 1};X_{t}^i))</math> respectively. Finally, this method optimizes the neural network by maximizing the difference between the two (i.e., <math>\mathrm{\Psi}</math>). The architecture diagram of this neural network system is shown in Figure a below.

[[文件:学习因果涌现表征的架构.png|居左|600x600像素|学习因果涌现表征的架构]]

+

Figure b shows a toy model example. The microscopic input <math>X_t(X_t^1,...,X_t^6) \in \{0,1\}^6</math> has six dimensions, and each dimension has two states of 0 and 1. <math>X_{t + 1}</math> is the output of <math>X_{t}</math> at the next moment. The macroscopic state is <math>V_{t}=\oplus_{i = 1}^{5}X_t^i</math>, where <math>\oplus_{i = 1}^{5}X_t^i</math> represents the result of adding the first five dimensions of the microscopic input <math>X_t</math> and taking the modulo 2. There is an equal <math>\gamma</math> probability that the macroscopic states at two consecutive moments are equal (<math>p(\oplus_{j = 1..5}X_{t + 1}^j=\oplus_{j = 1..5}X_t^j)= \gamma</math>). The sixth dimension of the microscopic input at two consecutive moments has an equal probability of <math>\gamma_{extra}</math> (<math>p(X_{t + 1}^6=X_t^6)= \gamma_{extra}</math>).

+

The results show that in the simple example shown in Figure (b), by maximizing <math>\mathrm{\Psi}</math> through the model constructed in Figure a, the experiment finds that the learned <math>\mathrm{\Psi}</math> is approximately equal to the true groundtruth <math>\mathrm{\Psi}</math>, verifying the effectiveness of model learning. This system can correctly judge the occurrence of causal emergence. However, this method also has the problem of being difficult to deal with complex multivariate situations. This is because the number of neural networks on the right side of the figure is proportional to the number of macroscopic and microscopic variable pairs. Therefore, the more the number of microscopic variables (dimensions), the more the number of neural networks will increase proportionally, which will lead to an increase in computational complexity. In addition, this method is only tested on very few cases, so it cannot be scaled up yet. Finally, more importantly, because the network calculates the approximate index of causal emergence and obtains a sufficient but not necessary condition for emergence, the various drawbacks of the above approximate algorithm will be inherited by this method.

+

====Neural Information Compression Method====

+

In recent years, emerging artificial intelligence technologies have overcome a series of major problems. At the same time, machine learning methods are equipped with various carefully designed neural network structures and automatic differentiation technologies, which can approximate any function in a huge function space. Therefore, Zhang Jiang et al. tried to propose a data-driven method based on neural networks to identify causal emergence from time series data [46][40]. This method can automatically extract effective coarse-graining strategies and macroscopic dynamics, overcoming various deficiencies of the Rosas method [37].

−

In recent years, emerging artificial intelligence technologies have overcome a series of major problems. At the same time, machine learning methods are equipped with various carefully designed neural network structures and automatic differentiation technologies, which can approximate any function in a huge function space. Therefore, Zhang Jiang et al. tried to propose a data-driven method based on neural networks to identify causal emergence from time series data [46][40]. This method can automatically extract effective coarse-graining strategies and macroscopic dynamics, overcoming various deficiencies of the Rosas method [37].

In this work, the input is time series data <math>(X_1,X_2,...,X_T )</math>, and <math>X_t\equiv (X_t^1,X_t^2,…,X_t^p )</math>, <math>p</math> represents the dimension of the input data. The author assumes that this set of data is generated by a general stochastic dynamical system:

+

+

Here [math]X(t)[/math] is the microscopic state variable, [math]f[/math] is the microscopic dynamics, and <math>\xi</math> represents the noise in the system dynamics and can model the random characteristics in the dynamical system. However, <math>f</math> is unknown.

+

The so-called causal emergence identification problem refers to such a functional optimization problem:

+

{{NumBlk|:|

第367行：第440行：

</math>

|{{EquationRef|2}}}}

+

Here, [math]\mathcal{J}[/math] is the dimension-averaged <math>EI</math> (see the entry effective information), <math>\mathrm{\phi}</math> is the coarse-graining strategy function, <math>f_{q}</math> is the macroscopic dynamics, <math>q</math> is the dimension of the coarsened macroscopic state, [math]\hat{X}_{t + 1}[/math] is the prediction of the microscopic state at time <math>t + 1</math> by the entire framework. This prediction is obtained by performing inverse coarse-graining operation (the inverse coarse-graining function is [math]\phi^{\dagger}[/math]) on the macroscopic state prediction <math>\hat{Y}_{t + 1}</math> at time <math>t + 1</math>. Here [math]\hat{Y}_{t + 1}\equiv f_q(Y_t)[/math] is the prediction of the macroscopic state at time <math>t + 1</math> by the dynamics learner according to the macroscopic state [math]Y_t[/math] at time <math>t</math>, where [math]Y_t\equiv \phi(X_t)[/math] is the macroscopic state at time <math>t</math>, which is obtained by coarse-graining [math]X_t[/math] by [math]\phi[/math]. Finally, the difference between [math]\hat{X}_{t + 1}[/math] and the real microscopic state data [math]X_{t + 1}[/math] is compared to obtain the microscopic prediction error.

+

The entire optimization framework is shown below:

+

[[文件:NIS_Optimization.png|替代=NIS优化框架|居左|400x400像素|NIS优化框架]]

+

The objective function of this optimization problem is <math>EI</math>, which is a functional of the functions [math]\phi,\hat{f}_q,\phi^{\dagger}[/math] (here the macroscopic dimension [math]q[/math] is a hyperparameter), so it is difficult to optimize. We need to use machine learning methods to try to solve it.

+

=====NIS=====

+

To identify causal emergence in the system, the author proposes a neural information squeezer (NIS) neural network architecture [46]. This architecture is based on an encoder-dynamics learner-decoder framework, that is, the model consists of three parts, which are respectively used for coarse-graining the original data to obtain the macroscopic state, fitting the macroscopic dynamics and inverse coarse-graining operation (decoding the macroscopic state combined with random noise into the microscopic state). Among them, the authors use invertible neural network (INN) to construct the encoder (Encoder) and decoder (Decoder), which approximately correspond to the coarse-graining function [math]\phi[/math] and the inverse coarse-graining function [math]\phi^{\dagger}[/math] respectively. The reason for using invertible neural network is that we can simply invert this network to obtain the inverse coarse-graining function (i.e., [math]\phi^{\dagger}\approx \phi^{-1}[/math]). This model framework can be regarded as a neural information compressor. It puts the microscopic state data containing noise into a narrow information channel, compresses it into a macroscopic state, discards useless information, so that the causality of macroscopic dynamics is stronger, and then decodes it into a prediction of the microscopic state. The model framework of the NIS method is shown in the following figure:

−

To identify causal emergence in the system, the author proposes a neural information squeezer (NIS) neural network architecture [46]. This architecture is based on an encoder-dynamics learner-decoder framework, that is, the model consists of three parts, which are respectively used for coarse-graining the original data to obtain the macroscopic state, fitting the macroscopic dynamics and inverse coarse-graining operation (decoding the macroscopic state combined with random noise into the microscopic state). Among them, the authors use invertible neural network (INN) to construct the encoder (Encoder) and decoder (Decoder), which approximately correspond to the coarse-graining function [math]\phi[/math] and the inverse coarse-graining function [math]\phi^{\dagger}[/math] respectively. The reason for using invertible neural network is that we can simply invert this network to obtain the inverse coarse-graining function (i.e., [math]\phi^{\dagger}\approx \phi^{-1}[/math]). This model framework can be regarded as a neural information compressor. It puts the microscopic state data containing noise into a narrow information channel, compresses it into a macroscopic state, discards useless information, so that the causality of macroscopic dynamics is stronger, and then decodes it into a prediction of the microscopic state. The model framework of the NIS method is shown in the following figure:

[[文件:NIS模型框架图.png|居左|500x500像素|替代=NIS模型框架图|NIS模型框架图]]

+

Specifically, the encoder function [math]\phi[/math] consists of two parts:

+

<math>

\phi\equiv \chi\circ\psi

</math>

+

Here [math]\psi[/math] is an invertible function implemented by an invertible neural network, [math]\chi[/math] is a projection function, that is, removing the last <math>p - q</math> dimensional components from the <math>p</math>-dimensional vector. Here <math>p,q</math> are the dimensions of the microscopic state and macroscopic state respectively. <math>\circ</math> is the composition operation of functions.

+

The decoder is the function [math]\phi^{\dagger}[/math], which is defined as:

+

<math>

\phi^{\dagger}(y)\equiv \psi^{-1}(y\bigoplus z)

</math>

+

Here [math]z\sim\mathcal{Ν}\left (0,I_{p - q}\right )[/math] is a [math]p-q[/math]-dimensional random vector that obeys the standard normal distribution.

+

However, if we directly optimize the dimension-averaged effective information, there will be certain difficulties. The article [46] does not directly optimize Equation {{EquationNote|1}}, but adopts a clever method. To solve this problem, the author divides the optimization process into two stages. The first stage is to minimize the microscopic state prediction error under the condition of a given macroscopic scale <math>q</math>, that is, <math>\min _{\phi, f_q, \phi^{\dagger}}\left\|\phi^{\dagger}(Y(t + 1)) - X_{t + 1}\right\|<\epsilon</math> and obtain the optimal macroscopic state dynamics [math]f_q^\ast[/math]; the second stage is to search for the hyperparameter <math>q</math> to maximize the effective information [math]\mathcal{J}[/math], that is, <math>\max_{q}\mathcal{J}(f_{q}^\ast)</math>. Practice has proved that this method can effectively find macroscopic dynamics and coarse-graining functions, but it cannot truly maximize EI in advance.

+

In addition to being able to automatically identify causal emergence based on time series data, this framework also has good theoretical properties. There are two important theorems:

+

'''Theorem 1''': The information bottleneck of the neural information squeezer. That is, for any bijection <math>\mathrm{\psi}</math>, projection <math>\chi</math>, macroscopic dynamics <math>f</math> and Gaussian noise <math>z_{p - q}\sim\mathcal{Ν}\left (0,I_{p - q}\right )</math>,

+

<math>

I\left(Y_t;Y_{t + 1}\right)=I\left(X_t;{\hat{X}}_{t + 1}\right)

</math>

+

always holds. This means that all the information discarded by the encoder is actually noise information unrelated to prediction.

+

'''Theorem 2''': For a trained model, <math>I\left(X_t;{\hat{X}}_{t + 1}\right)\approx I\left(X_t;X_{t + 1}\right)</math>. Therefore, combining Theorem 1 and Theorem 2, we can obtain for a trained model:

+

<math>

I\left(Y_t;Y_{t + 1}\right)\approx I\left(X_t;X_{t + 1}\right)

</math>

+

======Comparison with Classical Theories======

+

The NIS framework has many similarities with the computational mechanics framework mentioned in the previous sections. NIS can be regarded as an <math>\epsilon</math>-machine. The set of all historical processes <math>\overleftarrow{S}</math> in computational mechanics can be regarded as a microscopic state. All <math>R \in \mathcal{R}</math> represent macroscopic states. The function <math>\eta</math> can be understood as a coarse-graining function. <math>\epsilon</math> can be understood as an effective coarse-graining strategy. <math>T</math> corresponds to effective macroscopic dynamics. The characteristic of minimum randomness index characterizes the determinism of macroscopic dynamics and can be replaced by effective information in causal emergence. When the entire framework is fully trained and can accurately predict the future microscopic state, the encoded macroscopic state converges to the effective state, and the effective state can be regarded as the causal state in computational mechanics.

−

The NIS framework has many similarities with the computational mechanics framework mentioned in the previous sections. NIS can be regarded as an <math>\epsilon</math>-machine. The set of all historical processes <math>\overleftarrow{S}</math> in computational mechanics can be regarded as a microscopic state. All <math>R \in \mathcal{R}</math> represent macroscopic states. The function <math>\eta</math> can be understood as a coarse-graining function. <math>\epsilon</math> can be understood as an effective coarse-graining strategy. <math>T</math> corresponds to effective macroscopic dynamics. The characteristic of minimum randomness index characterizes the determinism of macroscopic dynamics and can be replaced by effective information in causal emergence. When the entire framework is fully trained and can accurately predict the future microscopic state, the encoded macroscopic state converges to the effective state, and the effective state can be regarded as the causal state in computational mechanics.

At the same time, the NIS framework also has similarities with the G-emergence theory mentioned earlier. For example, NIS also adopts the idea of Granger causality: optimizing the effective macroscopic state by predicting the microscopic state at the next time step. However, there are several obvious differences between these two frameworks: a) In the G-emergence theory, the macroscopic state needs to be manually selected, while in NIS, the macroscopic state is obtained by automatically optimizing the coarse-graining strategy; b) NIS uses neural networks to predict future states, while G-emergence uses autoregressive techniques to fit the data.

+

======Computational Examples======

+

The author of NIS conducted experiments in the spring oscillator model, and the results are shown in the following figure. Figure a shows that the results of encoding at the next moment linearly coincide with the iterative results of macroscopic dynamics, verifying the effectiveness of the model. Figure b shows that the learned two dynamics and the real dynamics also coincide, further verifying the effectiveness of the model. Figure c is the multi-step prediction effect of the model. The prediction and the real curve are very close. Figure d shows the magnitude of causal emergence at different scales. It is found that causal emergence is most significant when the scale is 2, corresponding to the real spring oscillator model. Only two states (position and velocity) are needed to describe the entire system.

−

The author of NIS conducted experiments in the spring oscillator model, and the results are shown in the following figure. Figure a shows that the results of encoding at the next moment linearly coincide with the iterative results of macroscopic dynamics, verifying the effectiveness of the model. Figure b shows that the learned two dynamics and the real dynamics also coincide, further verifying the effectiveness of the model. Figure c is the multi-step prediction effect of the model. The prediction and the real curve are very close. Figure d shows the magnitude of causal emergence at different scales. It is found that causal emergence is most significant when the scale is 2, corresponding to the real spring oscillator model. Only two states (position and velocity) are needed to describe the entire system.

[[文件:弹簧振子模型1.png|居左|600x600像素|替代=弹簧振子模型1|弹簧振子模型]]

+

=====NIS+=====

+

Although NIS took the lead in proposing a scheme to optimize EI to identify causal emergence in data, this method has some shortcomings: the author divides the optimization process into two stages, but does not truly maximize the effective information, that is, Equation {{EquationNote|1}}. Therefore, Yang Mingzhe et al. [40] further improved this method and proposed the NIS+ scheme. By introducing reverse dynamics and reweighting technique, the original maximization of effective information is transformed into maximizing its variational lower bound by means of variational inequality to directly optimize the objective function.

−

Although NIS took the lead in proposing a scheme to optimize EI to identify causal emergence in data, this method has some shortcomings: the author divides the optimization process into two stages, but does not truly maximize the effective information, that is, Equation {{EquationNote|1}}. Therefore, Yang Mingzhe et al. [40] further improved this method and proposed the NIS+ scheme. By introducing reverse dynamics and reweighting technique, the original maximization of effective information is transformed into maximizing its variational lower bound by means of variational inequality to directly optimize the objective function.

======Mathematical Principles======

+

Specifically, according to variational inequality and inverse probability weighting method, the constrained optimization problem given by Equation {{EquationNote|2}} can be transformed into the following unconstrained minimization problem:

−

Specifically, according to variational inequality and inverse probability weighting method, the constrained optimization problem given by Equation {{EquationNote|2}} can be transformed into the following unconstrained minimization problem:

<math>\min_{\omega,\theta,\theta'} \sum_{i = 0}^{T - 1}w(\boldsymbol{x}_t)||\boldsymbol{y}_t-g_{\theta'}(\boldsymbol{y}_{t + 1})||+\lambda||\hat{\boldsymbol{x}}_{t + 1}-\boldsymbol{x}_{t + 1}||</math>

+

Here <math>g</math> is the reverse dynamics, which can be approximated by a neural network and trained by the data of [math]y_{t + 1},y_{t}[/math] through the macroscopic state. <math>w(x_t)</math> is the inverse probability weight, and the specific calculation method is as follows:

+

<math>

w(\boldsymbol{x}_t)=\frac{\tilde{p}(\boldsymbol{y}_t)}{p(\boldsymbol{y}_t)}=\frac{\tilde{p}(\phi(\boldsymbol{x}_t))}{p(\phi(\boldsymbol{x}_t))}

</math>

+

Here <math>\tilde{p}(\boldsymbol{y}_{t})</math> is the target distribution and <math>p(\boldsymbol{y}_{t})</math> is the original distribution of the data.

+

======Workflow and Model Architecture======

+

The following figure shows the entire model framework of NIS+. Figure a is the input of the model: time series data, which can be trajectory sequence, continuous image sequence and EEG time series data, etc.; Figure c is the output of the model, including the degree of causal emergence, macroscopic dynamics, emergent patterns and coarse-graining strategies; Figure b is the specific model architecture. Different from the NIS method, two parts of reverse dynamics and reweighting technology are added.

−

The following figure shows the entire model framework of NIS+. Figure a is the input of the model: time series data, which can be trajectory sequence, continuous image sequence and EEG time series data, etc.; Figure c is the output of the model, including the degree of causal emergence, macroscopic dynamics, emergent patterns and coarse-graining strategies; Figure b is the specific model architecture. Different from the NIS method, two parts of reverse dynamics and reweighting technology are added.

[[文件:NIS+.png|居左|600x600像素|替代=NIS模型框架图|NIS+模型框架图]]

+

======Case Analysis======

+

The article conducts experiments on different time series data sets, including the data generated by the disease transmission dynamical system model SIR dynamics, the bird flock model (Boids model) and the cellular automaton: Game of Life, as well as the fMRI signal data of the brain nervous system of real human subjects. Here we choose the bird flock and brain signals for experimental introduction and description.

−

The article conducts experiments on different time series data sets, including the data generated by the disease transmission dynamical system model SIR dynamics, the bird flock model (Boids model) and the cellular automaton: Game of Life, as well as the fMRI signal data of the brain nervous system of real human subjects. Here we choose the bird flock and brain signals for experimental introduction and description.

The following figure shows the experimental results of NIS+ learning the flocking behavior of the Boids model. (a) and (e) give the actual and predicted trajectories of bird flocks under different conditions. Specifically, the author divides the bird flock into two groups and compares the multi-step prediction results under different noise levels (<math>\alpha</math> is 0.001 and 0.4 respectively). The prediction is good when the noise is relatively small, and the prediction curve will diverge when the noise is relatively large. (b) shows that the mean absolute error (MAE) of multi-step prediction gradually increases as the radius r increases. (c) shows the change of causal emergence measure <math>\Delta J</math> and prediction error (MAE) under different dimensions (q) with the change of training epoch. The author finds that causal emergence is most significant when the macroscopic state dimension q = 8. (d) is the attribution analysis of macroscopic variables to microscopic variables, and the obtained significance map intuitively describes the learned coarse-graining function. Here, each macroscopic dimension can correspond to the spatial coordinate (microscopic dimension) of each bird. The darker the color, the higher the correlation. Here, the microscopic coordinates corresponding to the maximum correlation of each macroscopic state dimension are highlighted with orange dots. These attribution significance values are obtained by using the Integrated Gradient (referred to as IG) method. The horizontal axis represents the x and y coordinates of 16 birds in the microscopic state, and the vertical axis represents 8 macroscopic dimensions. The light blue dotted line distinguishes the coordinates of different individual Boids, and the blue solid line separates the two bird flocks. (f) and (g) represent the changing trends of causal emergence measure <math>\Delta J</math> and normalized error MAE under different noise levels. (f) represents the influence of changes in external noise (that is, adding observation noise to microscopic data) on causal emergence. (g) represents the influence of internal noise (represented by <math>\alpha</math>, added by modifying the dynamics of the Boids model) on causal emergence. In (f) and (g), the horizontal line represents the threshold that violates the error constraint in Equation {{EquationNote|1}}. When the normalized MAE is greater than the threshold of 0.3, the constraint is violated and the result is unreliable.

+

This set of experiments shows that NIS+ can learn macroscopic states and coarse-graining strategies by maximizing EI. This maximization enhances the generalization ability of the model to situations beyond the range of training data. The learned macroscopic state effectively identifies the average group behavior and can be attributed to individual positions using the gradient integration method. In addition, the degree of causal emergence increases with the increase of external noise and decreases with the increase of internal noise. This observation result shows that the model can eliminate external noise through coarse-graining, but cannot reduce internal noise.

+

[[文件:NIS+ boids.png|居左|700x700像素|鸟群中的因果涌现]]

+

The brain experiment is based on real fMRI data, which is obtained by performing two sets of experiments on 830 human subjects. In the first group, the subjects were asked to perform a visual task of watching a short movie clip and the recording was completed. In the second group of experiments, they were asked to be in a resting state and the recording was completed. Due to the relatively high original dimension, the authors first reduced the original 14000-dimensional data to 100 dimensions by using the Schaefer atlas method, and each dimension corresponds to a brain region. After that, the authors learned these data through NIS+ and extracted the dynamics at six different macroscopic scales. Figure a shows the multi-step prediction error results at different scales. Figure b shows the comparison of EI of NIS and NIS+ methods on different macroscopic dimensions in the resting state and the visual task of watching movies. The authors found that in the visual task, causal emergence is most significant when the macroscopic state dimension is q = 1. Through attribution analysis, it is found that the visual area plays the largest role (Figure c), which is consistent with the real scene. Figure d shows different perspective views of brain region attribution. In the resting state, one macroscopic dimension is not enough to predict the microscopic time series data. The dimension with the largest causal emergence is between 3 and 7 dimensions.

+

[[文件:NIS+ 脑数据.png|居左|700x700像素|脑神经系统中的因果涌现]]

+

These experiments show that NIS+ can not only identify causal emergence in data, discover emergent macroscopic dynamics and coarse-graining strategies, but also other experiments show that the NIS+ model can also increase the out-of-distribution generalization ability of the model through EI maximization.

+

==Applications==

+

This subsection mainly explains the potential applications of causal emergence in various complex systems, including: biological systems, neural networks, brain nervous systems, artificial intelligence (causal representation learning, reinforcement learning based on world models, causal model abstraction) and some other potential applications (including consciousness research and Chinese classical philosophy).

−

This subsection mainly explains the potential applications of causal emergence in various complex systems, including: biological systems, neural networks, brain nervous systems, artificial intelligence (causal representation learning, reinforcement learning based on world models, causal model abstraction) and some other potential applications (including consciousness research and Chinese classical philosophy).

===Causal emergence in complex networks===

In 2020, Klein and Hoel improved the method of quantifying causal emergence on Markov chains to be applied to complex networks [47]. The authors defined the Markov chain in the network with the help of random walkers, placing random walkers on nodes is equivalent to intervening on nodes, and then defining the transition probability matrix between nodes based on the random walk probability. At the same time, the authors establish a connection between effective information and the connectivity of the network. Connectivity can be characterized by the uncertainty of the weights of the outgoing and incoming edges of nodes. Based on this, the effective information in complex networks is defined. For detailed methods, refer to Causal emergence in complex networks.

+

The authors conducted experimental comparisons in artificial networks such as random network (ER), preferential attachment network model (PA) and four types of real networks, and found that: for ER networks, the magnitude of effective information only depends on the connection probability <math>p</math>, and as the network size increases, it will converge to the value <math>-\log_2p</math>. At the same time, a key finding shows that there is a phase transition point in the EI value, which approximately appears at the position where the average degree (<math><k></math>) of the network is equal to <math>\log_2N</math>. This also corresponds to the phase transition point where the random network structure does not contain more information as its scale increases with the increase of the connection probability. For preferential attachment model networks, when the power-law exponent <math>\alpha<1.0</math> of the network's degree distribution, the magnitude of effective information will increase as the network size increases; when <math>\alpha>1.0</math>, the conclusion is opposite; <math>\alpha = 1.0</math> corresponds exactly to the scale-free network which is the growing critical boundary. For real networks, the authors found that biological networks have the lowest effective information because they have a lot of noise. However, we can remove this noise through effective coarse-graining, which makes biological networks show more significant causal emergence phenomena than other types of networks; while technical type networks are sparser and non-degenerate, so they have higher average efficiency, more specific node relationships, and the highest effective information, but it is difficult to increase the causal emergence measure through coarse-graining.

+

In this article, the authors use the greedy algorithm to coarse-grain the network. However, for large-scale networks, this algorithm is very inefficient. Subsequently, Griebenow et al. [48] proposed a method based on spectral clustering to identify causal emergence in preferential attachment networks. Compared with the greedy algorithm and the gradient descent algorithm, the spectral clustering algorithm has less computation time and the causal emergence of the found macroscopic network is also more significant.

+

===Application on biological networks===

Furthermore, Klein et al. extended the method of causal emergence in complex networks to more biological networks. As mentioned earlier, biological networks have more noise, which makes it difficult for us to understand their internal operating principles. This noise comes from the inherent noise of the system on the one hand, and is introduced by measurement or observation on the other hand. Klein et al. [49] further explored the relationship and specific meanings among noise, degeneracy and determinism in biological networks, and drew some interesting conclusions.

+

For example, high determinism in gene expression networks can be understood as one gene almost certainly leading to the expression of another gene. At the same time, high degeneracy is also widespread in biological systems during evolution. These two factors jointly lead to the fact that it is currently not clear at what scale biological systems should be analyzed to better understand their functions. Klein et al. [50] analyzed protein interaction networks of more than 1800 species and found that networks at macroscopic scales have less noise and degeneracy. At the same time, compared with nodes that do not participate in macroscopic scales, nodes in macroscopic scale networks are more resilient. Therefore, in order to meet the requirements of evolution, biological networks need to evolve macroscopic scales to increase certainty to enhance network resilience and improve the effectiveness of information transmission.

+

Hoel et al. in the article [51] further studied causal emergence in biological systems with the help of effective information theory. The author applied effective information to gene regulatory networks to identify the most informative heart development model to control the heart development of mammals. By quantifying the causal emergence in the largest connected component of the Saccharomyces cerevisiae gene network, the article reveals that informative macroscopic scales are ubiquitous in biology, and that life mechanisms themselves often operate on macroscopic scales. This article also provides biologists with a computable tool to identify the most informative macroscopic scale, and can model, predict, control and understand complex biological systems on this basis.

+

Swain et al. in the article [52] explored the influence of the interaction history of ant colonies on task allocation and task switching, and used effective information to study how noise spreads among ants. The results found that the degree of historical interaction between ant colonies affects task allocation, and the type of ant colony in specific interactions determines the noise in the interaction. In addition, even when ants switch functional groups, the emergent cohesion of ant colonies can ensure the stability of the colony. At the same time, different functional ant colonies also play different roles in maintaining the cohesion of the colony.

+

===Application on artificial neural networks===

+

Marrow et al. in the article [53] tried to introduce effective information into neural networks to quantify and track the changes in the causal structure of neural networks during the training process. Here, effective information is used to evaluate the degree of causal influence of nodes and edges on downstream targets of each layer. The effective information EI of each layer of neural network is defined as:

−

Marrow et al. in the article [53] tried to introduce effective information into neural networks to quantify and track the changes in the causal structure of neural networks during the training process. Here, effective information is used to evaluate the degree of causal influence of nodes and edges on downstream targets of each layer. The effective information EI of each layer of neural network is defined as:

<math>

I(L_1;L_2|do(L_1=H^{max}))

</math>

+

Here, <math>L_1</math> and <math>L_2</math> respectively represent the input and output layers connecting the neural network. Here, the input layer is do-intervened as a uniform distribution as a whole, and then the mutual information between cause and effect is calculated. Effective information can be decomposed into sensitivity and degeneracy. Here, sensitivity is defined as:

+

<math>

\sum_{(i \in L_1,j \in L_2)}I(t_i;t_j|do(i=H^{max}))

</math>

+

Here, i and j respectively represent any neuron combination in the input layer and output layer. <math>t_i</math> and <math>t_j</math> respectively represent the state combinations of neurons in the input and output layers after intervening i as the maximum entropy distribution under the condition that the neural network mechanism remains unchanged. That is to say, if the input neuron i is intervened to be a uniform distribution, the output neuron will also change. Then this value measures the mutual information between the two.

+

Here, it should be distinguished from the definition of effective information. Here, each neuron in the input layer is do-intervened separately, and then the mutual information calculated by each two neurons is accumulated as the definition of sensitivity. Degeneracy is obtained by the difference between effective information and sensitivity and is defined as:

+

<math>

I(L_1;L_2|do(L_1=H^{max}))-\sum_{(i \in L_1,j \in L_2)}I(t_i;t_j|do(i=H^{max}))

</math>.

+

By observing the effective information during the model training process, including the changes of sensitivity and degeneracy, we can know the generalization ability of the model, thereby helping scholars better understand and explain the working principle of neural networks.

+

===Application on the brain nervous system===

The brain nervous system is an emergent multi-scale complex system. Luppi et al. [54] Based on integrated information decomposition, the synergistic workspace of human consciousness is revealed. The authors constructed a three-layer architecture of brain cognition, including: external environment, specific modules and synergistic global space. The working principle of the brain mainly includes three stages: the first stage is responsible for collecting information from multiple different modules into the workspace; the second stage is responsible for integrating the collected information in the workspace; the third stage is responsible for broadcasting global information to other parts of the brain. The authors conducted experiments on three types of fMRI data in different resting states, including 100 normal people, 15 subjects participating in anesthesia experiments (including three different states before anesthesia, during anesthesia and recovery), and 22 subjects with chronic disorders of consciousness (DOC). This article uses integrated information decomposition to obtain synergistic information and redundant information, and uses the revised integrated information value <math>\Phi_R</math> to calculate the synergy and redundancy values between each two brain regions, so as to obtain whether the factor that each brain region plays a greater role is synergy or redundancy. At the same time, by comparing the data of conscious people, they found that the regions where the integrated information of unconscious people was significantly reduced all belonged to the brain regions where synergistic information played a greater role. At the same time, they found that the regions where the integrated information was significantly reduced all belonged to functional regions such as DMN (Default Mode Network), thus locating the brain regions that have a significant effect on the occurrence of consciousness.

+

===Application in artificial intelligence systems===

+

The causal emergence theory also has a very strong connection with the field of artificial intelligence. This is manifested in the following ways. First, the machine learning solution to the causal emergence identification problem is actually an application of causal representation learning. Second, technologies such as maximizing effective information are also expected to be applied to fields such as causal machine learning.

−

The causal emergence theory also has a very strong connection with the field of artificial intelligence. This is manifested in the following ways. First, the machine learning solution to the causal emergence identification problem is actually an application of causal representation learning. Second, technologies such as maximizing effective information are also expected to be applied to fields such as causal machine learning.

====Causal representation learning====

Causal representation learning is an emerging field in artificial intelligence. It attempts to combine two important fields in machine learning: representation learning and causal inference, and tries to combine their respective advantages to automatically extract important features and causal relationships behind the data [55]. Causal emergence identification based on effective information can be equivalent to a causal representation learning task. Identifying the emergence of causal relationships from data is equivalent to learning the underlying potential causal relationships and causal mechanisms of the data. Specifically, we can regard the macroscopic state as a causal variable, the macroscopic dynamics as a causal mechanism by analogy, the coarse-graining strategy can be regarded as an encoding process or representation from the original data to the causal variable, and the effective information can be understood as a measure of the causal effect strength of the mechanism.

+

Since there are many similarities between the two, the techniques and concepts of the two fields can learn from each other. For example, causal representation learning technology can be applied to causal emergence identification. In turn, the learned abstract causal representation can be interpreted as a macroscopic state, thereby enhancing the interpretability of causal representation learning. However, there are also significant differences between the two, mainly including two points: 1) Causal representation learning assumes that there is a real causal mechanism behind it, and the data is generated by this causal mechanism. However, there may not be a "true causal relationship" between the states and dynamics emerging at the macroscopic level; 2) The macroscopic state after coarse-graining in causal emergence is a low-dimensional description, but there is no such requirement in causal representation learning. From an epistemological perspective, there is no difference between the two, because both are extracting effective information from observational data to obtain representations with stronger causal effects.

+

To better compare causal representation learning and causal emergence identification tasks, we list the following table:

+

{| class="wikitable" style="text-align:center;"

第543行：第666行：

|'''Goal'''||Finding the optimal representation of the original data to ensure that an independent causal mechanism can be achieved through the representation||Finding an effective coarse-graining strategy and macroscopic dynamics with strong causal effects

|}

+

====Application of effective information in causal machine learning====

−

Causal emergence can enhance the performance of machine learning in out-of-distribution scenarios. The do-intervention introduced in <math>EI</math> captures the causal dependence in the data generation process and suppresses spurious correlations, thus supplementing machine learning algorithms based on associations and establishing a connection between <math>EI</math> and out-of-distribution generalization (Out Of Distribution, abbreviated as OOD) [56]. Due to the universality of effective information, causal emergence can be applied to supervised machine learning to evaluate the strength of the causal relationship between the feature space <math>X</math> and the target space <math>Y</math>, thereby improving the prediction accuracy from cause (feature) to result (target). It is worth noting that direct fitting of observations from <math>X</math> to <math>Y</math> is sufficient for common prediction tasks with the i.i.d. assumption, which means that the training data and test data are independently and identically distributed. However, if samples are drawn from outside the training distribution, a generalization representation space from training to test environments must be learned. Since it is generally believed that the generalization of causality is better than statistical correlation [57], therefore, the causal emergence theory can serve as a standard for embedding causal relationships in the representation space. The occurrence of causal emergence reveals the potential causal factors of the target, thereby producing a robust representation space for out-of-distribution generalization. Causal emergence may provide a unified representation measure for out-of-distribution generalization based on causal theory. <math>EI</math> can also be regarded as an information-theoretic abstraction of the out-of-distribution generalization's reweighting-based debiasing technique. In addition, we conjecture that out-of-distribution generalization can be achieved while maximizing <math>EI</math>, and <math>EI</math> may reach its peak at the intermediate stage of the original feature abstraction, which is consistent with the idea of OOD generalization, that is, less is more. Ideally, when causal emergence occurs at the peak of <math>EI</math>, all non-causal features are excluded and causal features are revealed, resulting in the most informative representation.

第551行：第674行：

=====Causal model abstraction=====

In complex systems, since microscopic states often have noise, people need to coarse-grain microscopic states to obtain macroscopic states with less noise, so that the causality of macroscopic dynamics is stronger. The same is true for causal models that explain various types of data. Due to the excessive complexity of the original model or limited computing resources, people often need to obtain a more abstract causal model and ensure that the abstract model maintains the causal mechanism of the original model as much as possible. This is the so-called causal model abstraction.

+

Causal model abstraction belongs to a subfield of artificial intelligence and plays an important role especially in causal inference and model interpretability. This abstraction can help us better understand the hidden causal mechanisms in the data and the interactions between variables. Causal model abstraction is achieved by evaluating the optimization of a high-level model to simulate the causal effects of a low-level model as much as possible [58]. If a high-level model can generalize the causal effects of a low-level model, we call this high-level model a causal abstraction of the low-level model.

+

Causal model abstraction also discusses the interaction between causal relationships and model abstraction (which can be regarded as a coarse-graining process) [59]. Therefore, causal emergence identification and causal model abstraction have many similarities. The original causal mechanism can be understood as microscopic dynamics, and the abstracted mechanism can be understood as macroscopic dynamics. In the neural information compression framework (NIS), researchers place restrictions on coarse-graining strategies and macroscopic dynamics, requiring that the microscopic prediction error of macroscopic dynamics be small enough to exclude trivial solutions. This requirement is also similar to causal model abstraction, which hopes that the abstracted causal model is as similar as possible to the original model. However, there are also some differences between the two: 1) Causal emergence identification is to coarse-grain states or data, while causal model abstraction is to perform coarse-graining operations on models; 2) Causal model abstraction considers confounding factors, but this point is ignored in the discussion of causal emergence identification.

+

=====Reinforcement learning based on world models=====

+

Reinforcement learning based on world models assumes that there is a world model inside the reinforcement learning agent, so that it can simulate the dynamics of the environment faced by the intelligent agent [60]. The dynamics of the world model can be learned through the interaction between the agent and the environment, thereby helping the agent to plan and make decisions in an uncertain environment. At the same time, in order to represent a complex environment, the world model must be a coarse-grained description of the environment. A typical world model architecture always contains an encoder and a decoder.

−

Reinforcement learning based on world models assumes that there is a world model inside the reinforcement learning agent, so that it can simulate the dynamics of the environment faced by the intelligent agent [60]. The dynamics of the world model can be learned through the interaction between the agent and the environment, thereby helping the agent to plan and make decisions in an uncertain environment. At the same time, in order to represent a complex environment, the world model must be a coarse-grained description of the environment. A typical world model architecture always contains an encoder and a decoder.

Reinforcement learning based on world models also has many similarities with causal emergence identification. The world model can also be regarded as a macroscopic dynamics. All states in the environment can be regarded as macroscopic states. These can be regarded as compressed states that ignore irrelevant information and can capture the most important causal features in the environment so that the agent can make better decisions. In the planning process, the agent can also use the world model to simulate the dynamics of the real world.

+

The similarities and common features between the two fields can help us borrow ideas and techniques from one field to another. For example, an agent with a world model can interact with a complex system as a whole and obtain emergent causal laws from the interaction, thereby better helping us with the task of causal emergence identification. In turn, maximizing effective information technology can also be used in reinforcement learning to make the world model have stronger causal characteristics.

+

===Other potential applications===

+

In addition to the above application fields, the causal emergence theory may have huge potential application value for other important issues. For example, it has certain prospects in the research of consciousness issues and the modern scientific interpretation of Chinese classical philosophy.

−

In addition to the above application fields, the causal emergence theory may have huge potential application value for other important issues. For example, it has certain prospects in the research of consciousness issues and the modern scientific interpretation of Chinese classical philosophy.

====Consciousness research====

+

First of all, the proposal of the causal emergence theory is greatly related to the research of consciousness science. This is because the core indicator of the causal emergence theory, effective information, was first proposed by Tononi in the quantitative theory of consciousness research, integrated information theory. After being modified, it was applied to Markov chains by Erik Hoel and the concept of causal emergence was proposed. Therefore, in this sense, effective information is actually a by-product of quantitative consciousness science.

−

First of all, the proposal of the causal emergence theory is greatly related to the research of consciousness science. This is because the core indicator of the causal emergence theory, effective information, was first proposed by Tononi in the quantitative theory of consciousness research, integrated information theory. After being modified, it was applied to Markov chains by Erik Hoel and the concept of causal emergence was proposed. Therefore, in this sense, effective information is actually a by-product of quantitative consciousness science.

Secondly, causal emergence, as an important concept in complex systems, also plays an important role in the research of consciousness science. For example, in the field of consciousness, a core question is whether consciousness is a macroscopic phenomenon or a microscopic phenomenon? So far, there is no direct evidence to show what scale consciousness occurs on. In-depth research on causal emergence, especially combined with experimental data of the brain nerve, may answer the question of the scale of occurrence of consciousness phenomena.

+

Thirdly, causal emergence may answer the question of free will. Do people have free will? Is the decision we make really a free choice of our will? Or is it possible that it is just an illusion? In fact, if we accept the concept of causal emergence and admit that macroscopic variables will have causal force on microscopic variables, then all our decisions are actually made spontaneously by the brain system, and consciousness is only a certain level of explanation of this complex decision-making process. Therefore, free will is an emergent downward causation. The answers to these questions await further research of the causal emergence theory.

+

====Chinese classical philosophy====

+

Different from Western science and philosophy, Chinese classical philosophy retains a complete and different theoretical framework for explaining the universe, including yin and yang, five elements, eight trigrams, as well as divination, feng shui, traditional Chinese medicine, etc., and can give completely independent explanations for various phenomena in the universe. For a long time, the two sets of philosophies in the East and the West have always been difficult to integrate. The idea of causal emergence may provide a new explanation to bridge the conflict between Eastern and Western philosophies.

−

Different from Western science and philosophy, Chinese classical philosophy retains a complete and different theoretical framework for explaining the universe, including yin and yang, five elements, eight trigrams, as well as divination, feng shui, traditional Chinese medicine, etc., and can give completely independent explanations for various phenomena in the universe. For a long time, the two sets of philosophies in the East and the West have always been difficult to integrate. The idea of causal emergence may provide a new explanation to bridge the conflict between Eastern and Western philosophies.

According to the causal emergence theory, the quality of a theory depends on the strength of causality, that is, the size of <math>EI</math>. And different coarse-graining schemes will obtain completely different macroscopic theories (macroscopic dynamics). It is very likely that when facing the same research object of complex systems, the Western philosophical and scientific system gives a set of relatively specific and microscopic causal mechanisms (dynamics), while Eastern philosophy gives a set of more coarsely grained macroscopic causal mechanisms. According to the causal emergence theory or the Causal Equivalence Principle proposed by Yurchenko, the two are completely likely to be compatible with each other. That is to say, for the same set of phenomena, the East and the West can make correct predictions and even intervention methods according to two different sets of causal mechanisms. Of course, it is also possible that in certain types of problems or phenomena, a more macroscopic causal mechanism is more explanatory or leads to a good solution. For some problems or phenomena, a more microscopic causal mechanism is more favorable.

+

For example, taking the concept of five elements in Eastern philosophy, we can completely understand the five elements as five macroscopic states of everything, and the relationship of mutual generation and restraint of the five elements can be understood as a macroscopic causal mechanism between these five macroscopic states. Then, the cognitive process of extracting these five states of the five elements from everything is a coarse-graining process, which depends on the observer's ability to analogize. Therefore, the theory of five elements can be regarded as an abstract causal emergence theory for everything. Similarly, we can also apply the concept of causal emergence to more fields, including traditional Chinese medicine, divination, feng shui, etc. The common point of these applications will be that its causal mechanism is simpler and possibly has stronger causality compared to Western science, but the process of obtaining such an abstract coarse-graining is more complex and more dependent on experienced abstractors. This explains why Eastern philosophies all emphasize the self-cultivation of practitioners. This is because these Eastern philosophical theories put huge complexity and computational burden on '''analogical thinking'''.

+

==Critique==

+

Throughout history, there has been a long-standing debate on the ontological and epistemological aspects of causality and emergence.

−

~~Throughout history, there has been a long-standing debate on the ontological and epistemological aspects of causality and emergence.~~

For example, Yurchenko pointed out in the literature [61] that the concept of "causation" is often ambiguous and should be distinguished into two different concepts of "cause" and "reason", which respectively conform to ontological and epistemological causality. Among them, cause refers to the real cause that fully leads to the result, while reason is only the observer's explanation of the result. Reason may not be as strict as a real cause, but it does provide a certain degree of predictability. Similarly, there is also a debate about the nature of causal emergence.

+

Is causal emergence a real phenomenon that exists independently of a specific observer? Here it should be emphasized that for Hoel's theory, different coarse-graining strategies can lead to different macroscopic dynamical mechanisms and different causal effect measurement results (<math>EI</math>). Essentially, different coarse-graining strategies can represent different observers. Hoel's theory links emergence with causality through intervention and introduces the concept of causal emergence in a quantitative way. Hoel's theory proposes a scheme to eliminate the influence of different coarse-graining methods, that is, maximizing <math>EI</math>. The coarse-graining scheme that maximizes EI is the only objective scheme. Therefore, for a given set of Markov dynamics, only the coarse-graining strategy and corresponding macroscopic dynamics that maximize <math>EI</math> can be considered objective results. However, when the solution that maximizes <math>EI</math> is not unique, that is, there are multiple coarse-graining schemes that can maximize <math>EI</math>, it will lead to difficulties in this theory and a certain degree of subjectivity cannot be avoided.

+

Dewhurst [62] provides a philosophical clarification of Hoel's theory, arguing that it is epistemological rather than ontological. This indicates that Hoel's macroscopic causality is only a causal explanation based on information theory and does not involve "true causality". This also raises questions about the assumption of uniform distribution (see the entry for effective information), as there is no evidence that it should be superior to other distributions.

+

In addition, Hoel's <math>EI</math> calculation and the quantification of causal emergence depend on two known prerequisite factors: (1) known microscopic dynamics; (2) known coarse-graining scheme. However, in practice, people rarely can obtain both of these factors at the same time, especially in observational studies, these two factors may be unknown. Therefore, this limitation hinders the practical applicability of Hoel's theory.

+

At the same time, it is pointed out that Hoel's theory ignores the constraints on the coarse-graining method, and some coarse-graining methods may lead to ambiguity [63]. In addition, some combinations of state coarse-graining operations and time coarse-graining operations do not exhibit commutativity. For example, assume that <math>A_{m\times n}</math> is a state coarse-graining operation (combining n states into m states). Here, the coarse-graining strategy is the strategy that maximizes the effective information of the macroscopic state transition matrix. <math>(\cdot) \times (\cdot)</math> is a time coarse-graining operation (combining two time steps into one). In this way, [math]A_{m\times n}(TPM_{n\times n})[/math] is to perform coarse-graining on a [math]n\times n[/math] TPM, and the coarse-graining process is simplified as the product of matrix [math]A[/math] and matrix [math]TPM[/math].

+

Then, the commutativity condition of spatial coarse-graining and temporal coarse-graining is the following equation:

+

{{NumBlk|:|

第605行：第743行：

</math>

|{{EquationRef|3}}}}

+

The left side represents first performing coarse-graining on the states of two consecutive time steps, and then multiplying the dynamics TPM of the two time steps together to obtain a transfer matrix for two-step evolution; the right side of the equation represents first multiplying the TPMs of two time steps together to obtain the two-step evolution of the microscopic state, and then using A for coarse-graining to obtain the macroscopic TPM. The non-satisfaction of this equation indicates that some coarse-graining operations will cause differences between the evolution of macroscopic states and the coarse-grained states of the microscopic system after evolution. This means that some kind of consistency constraint needs to be added to the coarse-graining strategy.

+

However, as pointed out in the literature [40], the above problem can be alleviated by considering the error factor of the model while maximizing EI in the continuous variable space. However, although machine learning techniques facilitate the learning of causal relationships and causal mechanisms and the identification of emergent properties, it is important whether the results obtained through machine learning reflect ontological causality and emergence, or are they just an epistemological phenomenon? This is still undecided. Although the introduction of machine learning does not necessarily solve the debate around ontological and epistemological causality and emergence, it can provide a dependence that helps reduce subjectivity. This is because the machine learning agent can be regarded as an "objective" observer who makes judgments about causality and emergence that are independent of human observers. However, the problem of a unique solution still exists in this method. Is the result of machine learning ontological or epistemological? The answer is that the result is epistemological, where the epistemic subject is the machine learning algorithm. However, this does not mean that all results of machine learning are meaningless, because if the learning subject is well trained and the defined mathematical objective is effectively optimized, then the result can also be considered objective because the algorithm itself is objective and transparent. Combining machine learning methods can help us establish a theoretical framework for observers and study the interaction between observers and the corresponding observed complex systems.

+

==Related research fields==

There are some related research fields that are closely related to causal emergence theory. Here we focus on introducing the differences and connections with three related fields: reduction of dynamical models, dynamic mode decomposition, and simplification of Markov chains.

+

===Reduction of dynamical models===

+

An important indicator of causal emergence is the selection of coarse-graining strategies. When the microscopic model is known, coarse-graining the microscopic state is equivalent to performing '''model reduction''' on the microscopic model. Model reduction is an important subfield in control theory. Antoulas once wrote a related review article [64].

−

An important indicator of causal emergence is the selection of coarse-graining strategies. When the microscopic model is known, coarse-graining the microscopic state is equivalent to performing '''model reduction''' on the microscopic model. Model reduction is an important subfield in control theory. Antoulas once wrote a related review article [64].

Model reduction is to simplify and reduce the dimension of the high-dimensional complex system dynamics model and describe the evolution law of the original system with low-dimensional dynamics. This process is actually the coarse-graining process in the study of causal emergence. There are mainly two types of approximation methods for large-scale dynamical systems, namely approximation methods based on singular value decomposition [64][65] and approximation methods based on Krylov [64][66][67]. The former is based on singular value decomposition, and the latter is based on moment matching. Although the former has many ideal properties, including error bounds, it cannot be applied to systems with high complexity. On the other hand, the advantage of the latter is that it can be implemented iteratively and is therefore suitable for high-dimensional complex systems. Combining the advantages of these two methods gives rise to a third type of approximation method, namely the SVD/Krylov method [68][69]. Both methods evaluate the model reduction effect based on the error loss function of the output function before and after coarse-graining. Therefore, the goal of model reduction is to find the reduced parameter matrix that minimizes the error.

+

In general, the error loss function of the output function before and after model reduction can be used to judge the coarse-graining parameters. This process defaults that the system reduction process will lose information. Therefore, minimizing the error is the only way to judge the effectiveness of the reduction method. However, from the perspective of causal emergence, effective information will increase due to dimensionality reduction. This is also the biggest difference between the coarse-graining strategy in causal emergence research and model reduction in control theory. When the dynamical system is a stochastic system [70], directly calculating the loss function will lead to unstable due to the existence of randomness, so the effectiveness of reduction cannot be accurately measured. The effective information and causal emergence index based on stochastic dynamical systems can increase the effectiveness of evaluation indicators to a certain extent and make the control research of stochastic dynamical systems more rigorous.

+

===Dynamic mode decomposition===

In addition to the reduction of dynamical models, dynamic mode decomposition is also closely related to coarse-graining. The basic idea of the dynamic mode decomposition (DMD) [71][72] model is to directly obtain the dynamic information of the flow in the flow field from the data and find the data mapping according to the flow field changes of different frequencies. This method is based on transforming nonlinear infinite-dimensional dynamics into finite-dimensional linear dynamics, and adopts the ideas of Arnoldi method and singular value decomposition for dimensionality reduction. It draws on many key features of time series such as ARIMA, SARIMA and seasonal models, and is widely used in fields such as mathematics, physics, and finance [73]. Dynamic mode decomposition sorts the system according to frequency and extracts the eigenfrequency of the system to observe the contribution of flow structures of different frequencies to the flow field. At the same time, the dynamic mode decomposition modal eigenvalue can predict the flow field. Because the dynamic mode decomposition algorithm has the advantages of theoretical rigor, stability, and simplicity. While being continuously applied, the dynamic mode decomposition algorithm is also continuously improved on the original basis. For example, it is combined with the SPA test to verify the strong effectiveness of the stock price prediction comparison benchmark point and by connecting the dynamic mode decomposition algorithm and spectral research. The way to simulate the vibration mode of the stock market in the circular economy. These applications can effectively collect and analyze data and finally obtain results.

+

Dynamic mode decomposition is a method of reducing the dimension of variables, dynamics, and observation functions simultaneously by using linear transformation [74]. This method is another method similar to the coarse-graining strategy in causal emergence, which takes minimizing error as the main goal for optimization. Although both model reduction and dynamic mode decomposition are very close to model coarse-graining, they are not optimized based on effective information. In essence, they both default to a certain degree of information loss and will not enhance causal effects. In the literature [75], the authors proved that in fact the error minimization solution set contains the optimal solution set of maximizing effective information. Therefore, if we want to optimize causal emergence, we can first minimize the error and find the best coarse-graining strategy in the error minimization solution set.

+

===Simplification of Markov chains===

The simplification of Markov chains (or called coarse-graining of Markov chains) is also importantly related to causal emergence. The coarse-graining process in causal emergence is essentially the simplification of Markov chains. Model simplification of Markov processes [76] is an important problem in state transition system modeling. It reduces the complexity of Markov chains by merging multiple states into one state.

+

There are mainly three meanings of simplification. First, when we study a very large-scale system, we will not pay attention to the changes of each microscopic state. Therefore, in coarse-graining, we hope to filter out some noise and heterogeneity that we are not interested in, and summarize some mesoscale or macroscopic laws from the microscopic scale. Second, some state transition probabilities are very similar, so they can be regarded as the same kind of state. Clustering this kind of state (also called partitioning the state) to obtain a new smaller Markov chain can reduce the redundancy of system representation. Third, in reinforcement learning using Markov decision processes, coarse-graining the Markov chain can reduce the size of the state space and improve training efficiency. In many literatures, coarse-graining and dimension reduction are equivalent [77].

+

Among them, there are two types of coarse-graining of the state space: hard partitioning and soft partitioning. Soft partitioning can be regarded as a process of breaking up the microscopic state and reconstructing some macroscopic states, and allowing the superposition of microscopic states to obtain macroscopic states. Hard partitioning is a strict grouping of microscopic states, dividing several microscopic states into one group without allowing overlap and superposition (see coarse-graining of Markov chains).

+

The coarse-graining of Markov chains not only needs to be done on the state space, but also on the transition matrix, that is, to simplify the original transition matrix according to the state grouping to obtain a new smaller transition matrix. In addition, the state vector needs to be reduced. Therefore, a complete coarse-graining process needs to consider the coarse-graining of state, transition matrix, and state vector at the same time. Thus, this leads to a new problem, that is, how should the transition probability in the new Markov chain obtained by state grouping be calculated? At the same time, can the normalization condition be guaranteed?

+

In addition to these basic guarantees, we usually also require that the coarse-graining operation of the transition matrix should be commutative with the transition matrix. This condition can ensure that the one-step evolution of the state vector after coarse-graining through the coarse-grained transition matrix (equivalent to macroscopic dynamics) is equivalent to first performing one-step transition matrix evolution on the state vector (equivalent to microscopic dynamics) and then performing coarse-graining. This puts forward requirements for both the state grouping (the coarse-graining process of the state) and the coarse-graining process of the transition matrix. This requirement of commutativity leads people to propose the requirement of clustering property of Markov chains.

+

For any hard partition of states, we can define the so-called concept of lumpability. Lumpability is a measure of clustering. This concept first appeared in Kemeny, Snell's Finite Markov Chains in 1969 [78]. Lumpability is a mathematical condition used to judge "whether a certain hard-blocked microscopic state grouping scheme is reducible to the microscopic state transition matrix". No matter which hard-blocking scheme the state space is classified according to, it has a corresponding coarse-graining scheme for the transition matrix and probability space [79].

+

Suppose a grouping method '''<math>A=\{A_1, A_2,...,A_r\}</math>''' is given for the Markov state space '''<math>A</math>'''. Here [math]A_i[/math] is any subset of the state space '''<math>A</math>''' and satisfies [math]A_i\cap A_j= \Phi[/math], where [math]\Phi[/math] represents the empty set. [math]\displaystyle{ s^{(t)} }[/math] represents the microscopic state of the system at time [math]\displaystyle{ t }[/math]. The microscopic state space is [math]\displaystyle{ S=\{s_1, s_2,...,s_n\} }[/math] and the microscopic state '''<math>s_i\in A</math>''' are all independent elements in the Markov state space. Let the transition probability from microscopic state <math>s_k</math> to <math>s_m</math> be <math>p_{s_k \rightarrow s_m} = p(s^{(t)} = s_m | s^{(t-1)} = s_k)</math>, and the transition probability from microscopic state <math>s_k</math> to macroscopic state <math>A_i</math> be <math>p_{s_k \rightarrow A_i} = p(s^{(t)} \in A_i | s^{(t-1)} = s_k)</math>. Then the necessary and sufficient condition for lumpability is that for any pair <math>A_i, A_j</math>, <math>p_{s_k \rightarrow A_j}</math> of every state <math>s_k</math> belonging to <math>A_i</math> is equal, that is {{NumBlk|:|

Complexivist Ran

150

个编辑

更改

Causal Emergence (查看源代码)

2024年11月1日 (五) 13:01的版本