Causal Emergence

来自集智百科 - 复杂系统|人工智能|复杂科学|复杂网络|自组织
跳到导航 跳到搜索

Causal emergence refers to a specific type of emergence phenomenon in dynamical systems, where stronger causal effects emerge at the system's macroscopic scale. In particular, for a class of Markov dynamical systems, after appropriate coarse-graining of the state space, the resulting macroscopic dynamics demonstrate stronger causal effects than their microscopic counterparts. This is referred to as causal emergence [1][2]. Furthermore, the theory of causal emergence uses measures of causal effects to quantify emergence phenomena in complex systems.


History

Development of related concepts

The theory of causal emergence seeks to answer the question of what emergence is from a phenomenological perspective, employing a causal-based quantitative research methodology. Consequently, the development of causal emergence is closely tied to our understanding of the improvement of concepts such as emergence and causality.

Emergence

Emergence has long been recognized as a key characteristic of complex systems and a central concept in discussions about system complexity and the relationship between macroscopic and microscopic levels [3][4]. Emergence can be understood as the phenomenon where the whole is greater than the sum of its parts, meaning that the whole exhibits new characteristics that the individual components do not possess [5]. While scholars have identified emergent phenomena across various fields [4][6], such as the collective flocking behavior of birds [7], the formation of consciousness in the brain, and the emergent capabilities of large language models [8], there is no universally accepted or unified understanding of this phenomenon. Earlier research on emergence has largely remained qualitative. For instance, Bedau et al. [9][10] conducted a classified research of emergence, dividing it into nominal emergence [11][12], weak emergence [9][13], and strong emergence [14][15].

  • Nominal emergence refers to attributes and patterns that can exist at the macroscopic level but not at the microscopic level. For example, the shape of a circle formed by pixels is an instance of nominal emergence [11][12].
  • Weak emergence refers to macroscopic attributes or processes that arise from complex interactions between individual components. It can also be understood as a characteristic that could, in principle, be simulated by a computer. Due to the principle of computational irreducibility, even though weak emergence traits can be simulated, they cannot easily be reduced to microscopic-level attributes. In the case of weak emergence, the causes of pattern formation may stem from both microscopic and macroscopic levels [14][15]. Therefore, the causal relationship underlying emergence may coexist with microscopic causal relationships.

These early studies reveal a deep and inherent connection between emergence and causality.

Causality and its measurement

Causality refers to the mutual influence between events. Causality differs from correlation because B occurs only when A occurs, and if A does not occur, B will not occur. Only by intervening in event A and then examining the result of B can one determine whether there is a causal relationship between A and B.

With the further development of causal inference in recent years, it is now possible to use a mathematical framework to quantify causality. The measure of causation here describes the causal effect of a dynamical process or a causal mechanism [19][20][21]. Judea Pearl [21] uses probabilistic graphical models to describe causal interactions. Pearl uses different models to distinguish and quantify three layers of causality. Here we are more concerned with the second layer in the causal ladder: intervening in the input distribution. Quantifying the causal effect between two variables is challenging due to inherent uncertainty and ambiguity. Many independent historical studies have addressed the issue of measuring causal relationships. These measurement methods include Hume's concept of constant connection [22] and value function-based methods [23], Eells and Suppes' probabilistic measures of causation [24][25], and Judea Pearl's measures of causation indicators among others. [19].

Causal emergence

As mentioned before, emergence and causality are interconnected. Specifically, the connection exists in the following aspects: Emergence arises as the causal effect of complex nonlinear interactions among the components of a complex system. Additionally, emergent properties exert causal effects on individual elements within complex systems. In addition, in the past, it was common to attribute macroscopic factors to the influence of microscopic factors. Macroscopic emergent patterns often lack identifiable microscopic causes. Thus, there is a profound connection between emergence and causality. Moreover, although we have a qualitative classification of emergence, we cannot quantitatively characterize the occurrence of emergence. Causality provides a framework to quantitatively characterize emergence.


In 2013, Erik Hoel, an American theoretical neurobiologist, introduced causality into the measurement of emergence. He proposed the concept of causal emergence and used effective information (EI for short) to quantify the degree of causality in system dynamics [1][2]. Causal emergence occurs when a system has a stronger causal effect on a macroscopic scale compared to its microscopic scale. Causal emergence effectively captures the differences and connections between the macroscopic and microscopic states of a system. At the same time, it combines the two core concepts - causality in in statistics or artificial intelligence and emergence in complex systems. Causal emergence also provides scholars with a quantitative perspective to answer a series of philosophical questions. For example, the top-down causation in living systems or social systems are often analyzed using the causal emergence framework. The top-down causation here refers to downward causation, which involves macroscopic-to-microscopic causal effects. For example, consider the phenomenon of a gecko breaking its tail, which illustrates how a macroscopic system influences its microscopic components. When encountering danger, the gecko directly breaks off its tail regardless of its condition. Here, the whole is the cause and the tail is the effect. Then there is a causal power from the whole pointing to the part.

Early work on quantifying emergence

There have been some related works in the early stages that attempted to quantitatively analyze emergence. The computational mechanics theory proposed by Crutchfield et al. [26] considers causal states. This theory explores similar ideas based on the division of state space and is very similar to Erik Hoel's causal emergence theory. On the other hand, Seth et al. proposed the G-emergence theory [27] to quantify emergence by using Granger causality.

Computational mechanics

The computational mechanics theory attempts to express the laws of causality in emergence with a quantitative framework, specifically by constructing a coarse-grained causal model from a random process so that this model can generate the time series of the observed random process [26].

Here, the random process can be represented by [math]\displaystyle{ \overleftrightarrow{s} }[/math]. Based on time [math]\displaystyle{ t }[/math], the random process can be divided into two parts: the process before time [math]\displaystyle{ t }[/math], [math]\displaystyle{ \overleftarrow{s_t} }[/math], and the process after time [math]\displaystyle{ t }[/math], [math]\displaystyle{ \overrightarrow{s_t} }[/math]. The computational mechanics framework denotes the set of all possible historical processes [math]\displaystyle{ \overleftarrow{s_t} }[/math] as [math]\displaystyle{ \overleftarrow{S} }[/math], and the set of all future processes [math]\displaystyle{ \overrightarrow{s_t} }[/math] as [math]\displaystyle{ \overrightarrow{S} }[/math].

The goal of computational mechanics is to establish a model that reconstructs and predicts the observed time series with high accuracy. However, the randomness of the sequence prevents perfect reconstruction. Therefore, we need a coarse-grained mapping to capture the ordered structure in the random sequence. This coarse-grained mapping can be characterized by a partitioning function [math]\displaystyle{ \eta: \overleftarrow{S}→\mathcal{R} }[/math]. It can divide [math]\displaystyle{ \overleftarrow{S} }[/math] into several mutually exclusive subsets, where all such subsets collectively form the complete set denoted as [math]\displaystyle{ \mathcal{R} }[/math].

Computational mechanics regards any subset [math]\displaystyle{ R \in \mathcal{R} }[/math] as a macroscopic state. For a set of macroscopic states [math]\displaystyle{ \mathcal{R} }[/math], computational mechanics uses Shannon entropy to define an index [math]\displaystyle{ C_\mu }[/math] to measure the statistical complexity of these states:

[math]\displaystyle{ C_\mu(\mathcal{R})\triangleq -\sum_{\rho\in \mathcal{R}} P(\mathcal{R}=\rho)\log_2 P(\mathcal{R}=\rho) }[/math]

It can be proved that when a set of states is used to build a predictive model, their statistical complexity is approximately equivalent to the size of the prediction model.

Furthermore, to balance predictability and simplicity, computational mechanics defines causal equivalence. If [math]\displaystyle{ P\left ( \overrightarrow{s}|\overleftarrow{s}\right )=P\left ( \overrightarrow{s}|{\overleftarrow{s}}'\right ) }[/math], then [math]\displaystyle{ \overleftarrow{s} }[/math] and [math]\displaystyle{ {\overleftarrow{s}}' }[/math] are causally equivalent. This equivalence relation can divide all historical processes into equivalence classes and define them as causal states. All causal states of the historical process [math]\displaystyle{ \overleftarrow{s} }[/math] can be characterized by a map [math]\displaystyle{ \epsilon \left ( \overleftarrow{s} \right ) }[/math]. Here, [math]\displaystyle{ \epsilon: \overleftarrow{\mathcal{S}}\rightarrow 2^{\overleftarrow{\mathcal{S}}} }[/math] is a function that maps the historical process [math]\displaystyle{ \overleftarrow{s} }[/math] to the causal state [math]\displaystyle{ \epsilon(\overleftarrow{s})\in 2^{\overleftarrow{\mathcal{S}}} }[/math].

Further, we can denote the causal transition probability between two causal states [math]\displaystyle{ S_i }[/math] and [math]\displaystyle{ S_j }[/math] as [math]\displaystyle{ T_{ij}^{\left ( s \right )} }[/math], which is similar to a coarsened macroscopic dynamics. The [math]\displaystyle{ \epsilon }[/math]-machine of a random process is defined as an ordered pair [math]\displaystyle{ \left \{ \epsilon,T \right \} }[/math], which is a pattern discovery machine that can achieve prediction by learning the [math]\displaystyle{ \epsilon }[/math] and [math]\displaystyle{ T }[/math] functions. This is equivalent to defining the identification problem of emergent causality. Here, the [math]\displaystyle{ \epsilon }[/math]-machine identifies emergent causality in data.

Computational mechanics demonstrates that the causal states derived from the ϵ-machine possess three key characteristics: maximum predictability , minimum statistical complexity , and minimum randomness , which are considered optimal. In addition, the authors introduced a hierarchical machine reconstruction algorithm that can calculate causal states and ϵ-machines from observational data. Although this algorithm may not be applicable to all scenarios, the authors demonstrated its application using examples such as chaotic dynamics, hidden Markov models, and cellular automata, providing numerical results and reconstruction paths [28].

Although the original computational mechanics does not give a clear definition and quantitative theory of emergence, some researchers further advanced the development of this theory. Shalizi et al. discussed the relationship between computational mechanics and emergence in their work. If a process [math]\displaystyle{ {\overleftarrow{s}}' }[/math] has higher prediction efficiency than process [math]\displaystyle{ \overleftarrow{s} }[/math], then emergence occurs in the process [math]\displaystyle{ {\overleftarrow{s}}' }[/math]. The prediction efficiency [math]\displaystyle{ e }[/math] of a process is defined as the ratio of its excess entropy(the mutual information between states of two successive time steps) to its statistical complexity ([math]\displaystyle{ e=\frac{E}{C_{\mu}} }[/math]), where [math]\displaystyle{ e }[/math] is a real number between 0 and 1. We can regard it as a part of the information about past states stored in the process. In two cases, [math]\displaystyle{ C_{\mu}=0 }[/math]. One is that this process is completely uniform and deterministic; the other is that it is independently and identically distributed. In both cases, there cannot be any interesting predictions, so we set [math]\displaystyle{ e=0 }[/math]. At the same time, the authors explained that emergence can be understood as a dynamical process in which a pattern gains the ability to adapt to different environments.

The causal emergence framework has many similarities with computational mechanics. All historical processes [math]\displaystyle{ \overleftarrow{s} }[/math] can be regarded as microscopic states. All [math]\displaystyle{ R \in \mathcal{R} }[/math] correspond to macroscopic states. The function [math]\displaystyle{ \eta }[/math] can be understood as a possible coarse-graining function. The causal state [math]\displaystyle{ \epsilon \left ( \overleftarrow{s} \right ) }[/math] is a special state that can at least have the same predictive power as the microscopic state [math]\displaystyle{ \overleftarrow{s} }[/math]. Therefore, [math]\displaystyle{ \epsilon }[/math] can be understood as an effective coarse-graining strategy. Causal state transitions [math]\displaystyle{ T }[/math] corresponds to effective macroscopic dynamics. The characteristic of minimum randomness characterizes the determinism of macroscopic dynamics and can be measured by EI in causal emergence.

G-emergence

The G-emergence theory was proposed by Seth in 2008 and is one of the earliest studies to quantify emergence from a causal perspective [27]. The basic idea is to use nonlinear Granger causality to quantify weak emergence in complex systems.

Specifically, if we use a binary autoregressive model for prediction, it involves two variables, A and B. The autoregressive model comprises two equations, each corresponding to one variable. The current value of each variable is determined by its own past values and the past values of the other variable within a certain time lag range. In addition, the model also calculates residuals. Residuals, or prediction errors, measure the degree of Granger causal effect (G-causality) for each equation. The extent of the Granger causality of B on A is determined by the logarithm of the ratio of residual variances: one from A's model without B and the other from the full model including A and B. The concept of “G-autonomous” measures how well past values of a time series predict its future values. The strength of this autonomous predictive causal effect can be characterized in a similar way to G-causality.

G-emergence理论图

As shown in the figure, the occurrence of emergence can be judged based on two key concepts in the G-causality. This measure of emergence, derived from Granger causality, is denoted as G-emergence. A can be understood as a macroscopic variable, while B represents a microscopic variable. The conditions for emergence to occur are as follows: 1) A is G-autonomous with respect to B; 2) B is a G-cause of A. The degree of G-emergence is calculated by multiplying the degree of A's G-autonomous by the degree of B's average G-cause.

The G-emergence theory proposed by Seth is the first attempt to use measures of causal effects to quantify emergence phenomena. However, the causal relationship used by the author is Granger causality, which is not a strict causal relationship. At the same time, the measurement results also depend on the regression method used. In addition, the measurement index of this method is defined according to variables rather than dynamics, which means that the results will depend on the choice of variables. These limitations highlight the drawbacks of the G-emergence theory.

The causal emergence framework also has similarities with the aforementioned G-emergence. The macroscopic states of both methods need to be manually selected. In addition, it should be noted that some of the above methods for quantifying emergence often do not consider true interventionist causality.

Other theories for quantitatively characterizing emergence

Several other quantitative theories of emergence have been proposed. There are mainly two methods that are widely discussed. One approach views emergence as the transition from disorder to order. Moez Mnif and Christian Müller-Schloer [29] use Shannon entropy to measure order and disorder. In the self-organization process, emergence occurs when order increases. The increase in order is calculated by measuring the difference in Shannon entropy between the initial state and the final state. However, this method depends on the observation level and the system's initial conditions, which limits its applicability. To overcome these two difficulties, the authors propose a measurement method compared with the maximum entropy distribution. Inspired by the work of Moez mif and Christian Müller-Schloer, reference [30] suggests using the divergence between two probability distributions to quantify emergence. They understand emergence as an unexpected or unpredictable distribution change based on the observed samples. However, this method suffers from high computational complexity and low accuracy. To solve these problems, reference [31] proposes an approximate method for estimating density using Gaussian mixture models. Additionally, it introduces Mahalanobis distance to characterize the difference between data and Gaussian components, resulting in improved outcomes. In addition, Holzer, de Meer et al. [32][33] proposed another emergence measurement method based on Shannon entropy. They believe that a complex system is a self-organizing process in which different individuals interact through communication. Emergence can be measured as the ratio of the Shannon entropy of all inter-agent communications to the sum of individual Shannon entropies.

Another approach defines emergence as "the whole being greater than the sum of its parts" [34][35]. This approach focuses on interaction rules and the agent states rather than statistical measures of the system as a whole. Specifically, this measure consists of two subtracting terms. The first term represents the system's collective state, while the second term is the sum of its individual components. This measure emphasizes that emergence arises from the interactions and collective behavior of the system.

Causal emergence theory based on effective information

The first relatively comprehensive quantitative theory using causality to define emergence is the causal emergence theory by Erik Hoel, Larissa Albantakis and Giulio Tononi [1][2]. This theory defines causal emergence of Markov chains as the phenomenon that the coarsened Markov chain has a stronger causal effect than the original Markov chain. Here, the strength of causal effect is measured by EI. This indicator is a modification of the mutual information indicator. The key distinction is the use of do-intervention to transform the state variable at time [math]\displaystyle{ t }[/math] into a uniform distribution (or maximum entropy distribution). The EI indicator was proposed by Giulio Tononi as early as 2003 when studying integrated information theory. Building on Giulio Tononi's work, Erik Hoel applied EI to Markov chains, leading to the development of the causal emergence theory.

Causal emergence theory based on information decomposition

Rosas et al. (2020)[36] proposed a method using information decomposition theory to define and quantify causal emergence from an information theory perspective. information decomposition is a new method to analyze the complex interrelationships of various variables in complex systems. Information decomposition represents each piece of partial information as an information atom. Partial information is projected onto a information atom using an information lattice diagram. Both synergistic information and redundant information can be represented by the corresponding information atoms. This method is based on the partial information decomposition(PID) theory proposed by Williams and Beer [37]. The paper uses partial information decomposition (PID) to break down the mutual information between microstates and macrostates. However, the PID framework can only decompose the mutual information between multiple source variables and one target variable. Rosas extended this framework and proposed the integrated information decomposition method [math]\displaystyle{ \Phi ID }[/math] [38] and redefine causal emergence by synergistic information. And this information can further be decomposed into causal decoupling and downward causation .

Recent works

Barnett et al. [39] proposed the concept of dynamical decoupling, defining emergence based on the decoupling of macroscopic and microscopic dynamics using transfer entropy. That is, emergence is characterized by the independence of macroscopic and microscopic variables, with no causal relationship, which can be seen as a causal emergence phenomenon.


Zhang et al. (2024) [40] introduced a new causal emergence theory using singular value decomposition. The core idea is that the strength of cause effect is equivalent to the concept of approximate dynamical reversibility, and therefore, causal emergence is emergence of reversible dynamics. The approximate dynamical reversibility measure of Markov dynamics ([math]\displaystyle{ \Gamma_{\alpha}\equiv \sum_{i=1}^N\sigma_i^{\alpha} }[/math]) is defined as the sum of [math]\displaystyle{ \alpha }[/math] power of its singular values. Here, [math]\sigma_i[/math] represents the singular value, and [math]\alpha[/math] is a parameter to balance the relative weight of the capability of reserving information in forward and backward dynamics. This index is highly correlated with EI and can also be used to characterize the causal effect strength of dynamics. This method defines clear emergence and vague emergence directly from the spectrum of singular values of the Markov dynamics, without requiring an operation of coarse-graining.

Quantification of causal emergence

This section introduces studies that use measures of causal effects to quantify emergence phenomena.

Theories of Emergence

Defining causal emergence is a key issue. Several representative works include the method based on EI by Hoel et al. [1][2], and the information decomposition method by Rosas et al. [36]. Zhang et al. [40], also proposed a new causal emergence theory based on singular value decomposition.

Erik Hoel's theory of causal emergence

Hoel et al. (2013) [1][2] proposed the theory of causal emergence. The following figure is an abstract representation of the whole framework for this theory. The horizontal axis represents time and the vertical axis represents scale. This framework describes the same dynamical system at both microscopic and macroscopic scales. Among them, [math]f_m[/math] represents the microscopic dynamics, [math]f_M[/math] represents the macroscopic dynamics, and the two are connected by a coarse-graining function [math]\phi[/math]. In a discrete-state Markov dynamical system, both [math]f_m[/math] and [math]f_M[/math] are transitional probability matrices (TPMs) of Markov chains. By performing coarse-graining of the Markov chain on [math]f_m[/math], [math]f_M[/math] can be obtained. [math]\displaystyle{ EI }[/math] is a measure of effective information. Since the microscopic state may have greater randomness, which leads to relatively weak causal effect in microscopic dynamics, by performing reasonable coarse-graining on the microscopic state at each moment, it is possible to obtain a macroscopic state with stronger causality. Causal emergence refers to the phenomenon where coarse-graining a microscopic state can lead to a macroscopic dynamics with higher EI. The difference in EI between the macroscopic and the microscopic dynamics is defined as the strength of causal emergence.

因果涌现示意图

Effective information

Tononi et al. first introduced Effective Information ([math]\displaystyle{ EI }[/math]) in their study on integrated information theory [41]. In the study of causal emergence, Erik Hoel and colleagues use this measures of causal effects index to quantify the strength of causal effect for a causal mechanism.

Specifically, [math]\displaystyle{ EI }[/math] is calculated by intervening on the cause variable and measuring the mutual information between the cause and effect variables. This mutual information represents EI, quantifying the causal effect of the mechanism.

In a Markov chain, the state variable [math]X_{t}[/math] represents the cause, while [math]X_{t + 1}[/math] represents the effect. Thus, the [math]X_{t + 1}[/math] at the next time can be regarded as the result. Thus, the state transition matrix of a Markov chain represents its causal mechanism. Therefore, the calculation formula for [math]\displaystyle{ EI }[/math] for a Markov chain is as follows:

[math]\displaystyle{ \begin{aligned} EI(f) \equiv& I(X_t,X_{t+1}|do(X_t)\sim U(\mathcal{X}))\equiv I(\tilde{X}_t,\tilde{X}_{t+1}) \\ &= \frac{1}{N}\sum^N_{i=1}\sum^N_{j=1}p_{ij}\log\frac{N\cdot p_{ij}}{\sum_{k=1}^N p_{kj}} \end{aligned} }[/math]

[math]\displaystyle{ f }[/math] is the state transition matrix of a Markov chain, [math]U(\mathcal{X})[/math] represents the uniform distribution over the value space [math]\mathcal{X}[/math] of the state variable [math]X_{t}[/math]. [math]X_{t}[/math], [math]X_{t+1}[/math] are the states at two consecutive moments following the intervention of [math]X_{t}[/math] at time [math]\displaystyle{ t }[/math] into a uniform distribution. [math]\displaystyle{ p_{ij} }[/math] denots the transition probability from state [math]\displaystyle{ i }[/math] to state [math]\displaystyle{ j }[/math]. The formula shows that [math]\displaystyle{ EI }[/math] depends solely on the probability transition matrix [math]f[/math]. The intervention operation is to ensure that EI measures the causal effects of the dynamics objectively. This approach avoids influence from the original data distribution.

EI can be decomposed into two parts: determinism and degeneracy. It is also possible to eliminate the influence of the size of the state space by introducing normalization. For more detailed information about EI, please refer to the entry: Effective Information.

The effective coefficient(or the effectiveness) is introduced to eliminate the influence of the number of states on EI. The formula is shown as follows:

[math]\displaystyle{ \begin{aligned} Eff(f) = \frac{EI}{\log_2 N} \end{aligned} }[/math]

where [math]\displaystyle{ N }[/math] represents the number of states.

The degree of causal emergence

The degree of causal emergence can be defined by comparing the EI of macroscopic and microscopic dynamics in the system:

[math]\displaystyle{ CE = EI\left ( f_M \right ) - EI\left (f_m \right ) }[/math]

Here [math]\displaystyle{ CE }[/math] is the degree of causal emergence. Macroscopic dynamics exhibit causal emergence if their EI exceeds that of microscopic dynamics, i.e., [math]\displaystyle{ CE\gt 0 }[/math].

Furthermore, [math]\displaystyle{ CE }[/math] can be decomposed into the sum of two terms, [math]\displaystyle{ CE = \Delta I_{Eff} + \Delta I_{Size} }[/math], where [math]\displaystyle{ \Delta I_{Eff} = (Eff(f_M) - Eff(f_m))\cdot \log_2(M) }[/math] is the causal emergence induced by the increase of effectiveness, and [math]\displaystyle{ \Delta I_{Size} = Eff(f_m)\cdot(\log_2(M) - \log_2(m)) }[/math] is the causal emergence induced by the shrink of state space. Here, [math]\displaystyle{ M }[/math] and [math]\displaystyle{ m }[/math] represent the sizes of the macroscopic and microscopic states respectively. Because coarse-graining reduces the macroscopic state, [math]\displaystyle{ \Delta I_{Size} }[/math] is necessarily less than 0. Causal emergence occurs if the increase in [math]\displaystyle{ \Delta I_{Eff} }[/math] exceeds the decrease in [math]\displaystyle{ \Delta I_{Size} }[/math].

An Example of Markov chain

In the literature [1], Hoel gives an example of a state transition matrix ([math]f_m[/math]) of a Markov chain with 8 states, as shown in the left figure below. Among them, the first 7 states transfer with equal probability, and the last state is independent and can only transition to its own state.

First, we can use a coarse-graining strategy that merges the first 7 states into one macroscopic state, tentatively called A, and then sum the probability values in the first 7 rows and the first 7 columns of [math]f_m[/math] and divide by 7, thereby obtaining the state transition probability from macroscopic state A to A, while leaving the other values of the [math]f_m[/math] matrix unchanged to obtain a coarsen TPM which is shown in the right figure, denoted as [math]f_M[/math]. This definite macroscopic Markov transition matrix ensures that the future state is entirely determined by the current state. At this time [math]\displaystyle{ EI(f_M)\gt EI(f_m) }[/math], and causal emergence occurs in the system.

居左

However, for more general Markov chains and more general state groupings, this simple operation of averaging probabilities is not always feasible. This occurs when the merged transition matrix does not meet Markov chain conditions, such as row normalization or element values within [0, 1]. Refer to the “Reduction of Markov Chains” section or the “Coarse-graining of Markov Chains” entry for details on feasible macroscopic Markov chains.

An Example of Boolean network

Another instance of causal emergence in the literature [1] involves a Boolean network. The figure illustrates a Boolean network with 4 nodes. Each node can exist in one of two states, 0 or 1. Each node is connected to two other nodes and operates according to identical microscopic node mechanism (figure a). Therefore, this system contains a total of sixteen microscopic states, and its dynamics can be represented by a TPM (figure c).

The coarse-graining operation of this system is divided into two steps. The first step is to cluster the nodes in the Boolean network. As shown in figure b below, node A and B are merged to form the macroscopic node [math]\alpha[/math], and node C and D are merged to obtain the macroscopic node [math]\beta[/math]. The second step is to map the microscopic node states in each group to the merged macroscopic node states. This mapping function is shown in figure d below. All microscopic node states containing 0 are transformed into the "off" state of the macroscopic node, while the microscopic 11 state is transformed into the "on" state of the macroscopic. In this way, we can obtain a new macroscopic Boolean network, and derive the node mechanism of the macroscopic Boolean network according to the mechanism of the microscopic nodes. According to this mechanism, the state transition matrix of the macroscopic network can be obtained (as shown in figure e).

A comparison reveals that the EI of macroscopic dynamics exceeds that of microscopic dynamics [math]\displaystyle{ EI(f_M\ )\gt EI(f_m\ ) }[/math]. This demonstrates the occurrence of causal emergence in this system.

含有4个节点布尔网络的因果涌现

Causal emergence in continuous variables

Furthermore, in the paper [42], Hoel et al. proposed the theoretical framework of causal geometry, with the aim of generalizing causal emergence theory to function mappings and dynamical systems with continuous states. This article defines [math]\displaystyle{ EI }[/math] for random function mapping, and also introduces the concepts of intervention noise and causal geometry, and compares this concept with information geometry. Liu et al.[43] developed an exact analytical framework for causal emergence in random iterative dynamical systems.

Rosas's causal emergence theory

Rosas et al. [36] propose from the perspective of information decomposition theory a method for defining causal emergence based on integrated information decomposition, and further divide causal emergence into two parts: causal decoupling (Causal Decoupling) and downward causation (Downward Causation). Causal decoupling represents the causal effect of the macroscopic state at the current moment on the macroscopic state at the next moment. Downward causation represents the causal effect of the macroscopic state at the previous moment on the microscopic state at the next moment. The schematic diagrams of causal decoupling and downward causation are shown in the following figure. The microscopic state input is [math]\displaystyle{ X_t\ (X_t^1,X_t^2,…,X_t^n ) }[/math], while the macroscopic state, represented as [math]\displaystyle{ V_t }[/math], is derived through the coarse-graining of the microscopic state variable [math]\displaystyle{ X_t }[/math]. Consequently, it serves as a supervenient feature of [math]\displaystyle{ X_t }[/math]. Additionally, [math]\displaystyle{ X_{t + 1} }[/math] and [math]\displaystyle{ V_{t + 1} }[/math] denote the microscopic and macroscopic states at the subsequent moment, respectively.

Partial information decomposition

This method is based on the nonnegative decomposition of multivariate information theory proposed by Williams and Beer [37]. This paper uses partial information decomposition (PID) to decompose the mutual information between microstates and macrostates.

Without loss of generality, assume that our microstate is [math]\displaystyle{ X(X^1,X^2) }[/math], meaning it is a two-dimensional variable, and the macrostate is V. Then the mutual information between them can be decomposed into four parts:

[math]\displaystyle{ I(X^1,X^2;V)=Red(X^1,X^2;V)+Un(X^1;V│X^2)+Un(X^2;V│X^1)+Syn(X^1,X^2;V) }[/math]

Among them, [math]\displaystyle{ Red(X^1,X^2;V) }[/math] represents the redundant information, which refers to the information repeatedly provided by two microstates [math]\displaystyle{ X^1 }[/math] and [math]\displaystyle{ X^2 }[/math] to the macrostate [math]\displaystyle{ V }[/math]; [math]\displaystyle{ Un(X^1;V│X^2) }[/math] and [math]\displaystyle{ Un(X^2;V│X^1) }[/math] represent unique information, which refers to the information provided by each micro-state variable alone to the macro-state; [math]\displaystyle{ Syn(X^1,X^2;V) }[/math] represents the synergistic information, which refers to the information provided by all micro-states [math]\displaystyle{ X }[/math] jointly to the macro-state [math]\displaystyle{ V }[/math].

Definition of causal emergence

However, the PID framework is limited to decomposing the mutual information between multiple source variables and one target variable. Rosas extended this framework and proposed the integrated information decomposition method [math]\displaystyle{ \Phi ID }[/math][38] to handle the mutual information between multiple source variables and multiple target variables. It can also be used to decompose the mutual information between different moments. Based on the decomposed information, the author proposed two methods for defining causal emergence:

1) When the unique information [math]\displaystyle{ Un(V_t;X_{t+1}| X_t^1,\ldots,X_t^n\ )\gt 0 }[/math], it means that the macroscopic state [math]\displaystyle{ V_{t} }[/math] at the current moment contains more predictive information about the overall system [math]\displaystyle{ X_{t + 1} }[/math] at the next moment than the microscopic state [math]\displaystyle{ X_t }[/math] at the current moment. In this scenario, there exists causal emergence in the system;

2) The second method bypasses the selection of a specific macroscopic state [math]\displaystyle{ V_t }[/math]. It defines causal emergence solely based on the synergistic information between the microscopic state [math]\displaystyle{ X_t }[/math] and the microscopic state [math]\displaystyle{ X_{t + 1} }[/math] at the next moment of the system. When the synergistic information [math]\displaystyle{ Syn(X_t^1,…,X_t^n;X_{t + 1}^1,…,X_{t + 1}^n)\gt 0 }[/math], causal emergence occurs in the system.

It should be noted that for the first method to judge the occurrence of causal emergence, its accuracy depends on the selection of the macroscopic state [math]\displaystyle{ V_t }[/math]. The first method is the lower bound of the second method. This is because [math]\displaystyle{ Syn(X_t;X_{t+1}\ ) ≥ Un(V_t;X_{t+1}| X_t\ ) }[/math] always holds. So, if [math]\displaystyle{ Un(V_t;X_{t + 1}|X_t) }[/math] is greater than 0, then causal emergence occurs in the system. However, the selection of [math]V_t[/math] often requires pre-defining a coarse-graining function, so the limitations of the Erik Hoel causal emergence theory cannot be avoided.

Another natural idea is to use the second method to judge the occurrence of causal emergence with the help of synergistic information. However, the calculation of synergistic information is very difficult and there is a combinatorial explosion problem. Therefore, the calculation based on synergistic information in the second method is often infeasible. In short, both quantitative characterization methods of causal emergence have some weaknesses, so a more reasonable quantification method needs to be proposed.

Specific example

The author of the paper [36] lists a specific example (as above), to demonstrate the condition under which causal decoupling, downward causation and causal emergence take place. This example is a special Markov process in the following picture:

居左

Here, [math]\displaystyle{ p_{X_{t + 1}|X_t}(x_{t + 1}|x_t) }[/math] represents the micro-dynamics, and [math]\displaystyle{ X_t=(x_t^1,…,x_t^n)\in\{0,1\}^n }[/math] is the micro-state. The definition of this process is to determine the probability of taking different values of the state [math]\displaystyle{ x_{t + 1} }[/math] at the next moment by checking the values of the variables [math]\displaystyle{ x_t }[/math] and [math]\displaystyle{ x_{t + 1} }[/math] at two consecutive moments. This involves determining if the sum modulo 2 of all dimensions of [math]\displaystyle{ x_t }[/math] is the same as the first dimension of [math]\displaystyle{ x_{t + 1} }[/math]: if they are different, the probability is 0; otherwise, it will judge whether [math]\displaystyle{ x_t,x_{t + 1} }[/math] have the same sum modulo 2 value in all dimensions. If both conditions are satisfied, the value of probability is [math]\displaystyle{ \gamma/2^{n - 2} }[/math], otherwise the probability is [math]\displaystyle{ (1-\gamma)/2^{n - 2} }[/math]. Here [math]\displaystyle{ \gamma }[/math] is a parameter and [math]\displaystyle{ n }[/math] is the total dimension of [math]x[/math].

In fact, if [math]\displaystyle{ \sum_{j = 1}^n x^j_t }[/math] is even or 0, then [math]\displaystyle{ \oplus^n_{j = 1} x^j_t:=1 }[/math], otherwise [math]\displaystyle{ \oplus^n_{j = 1} x^j_t:=0 }[/math]. Therefore, the result of [math]\displaystyle{ \oplus^n_{j = 1} x^j_t }[/math] is the parity of the entire [math]X[/math] sequence, and the first dimension can be regarded as a parity check bit. [math]\displaystyle{ \gamma }[/math] actually represents the probability that a mutation occurs in two bits of the [math]X[/math] sequence. This mutation can ensure that the parity of the entire sequence remains unchanged, and the parity check bit of the sequence also conforms to the actual parity of the entire sequence.

Therefore, the macroscopic state of this process can be regarded as the overall parity of the sequence, and the probability distribution of this parity is the result of the XOR calculation of the micro-state. [math]\displaystyle{ x_{t + 1}^1 }[/math] is a special micro-state that always remains consistent with the macro-scopic state of the sequence at the previous moment. Therefore, when only the first item in the second judgment condition is satisfied, the downward causation of the system occurs. When only the second item is satisfied, the causal decoupling of the system occurs. When both items are satisfied simultaneously, it is said that causal emergence occurs in the system.

Causal emergence theory based on singular value decomposition

Erik Hoel's causal emergence theory needs to specify a coarse-graining strategy in advance. Rosas' information decomposition theory does not completely solve this problem. Therefore, Zhang et al.[40] further proposed the causal emergence theory based on singular value decomposition.

Singular value decomposition of Markov chain

Given the Markov transition matrix [math]\displaystyle{ P }[/math] of a system, we can perform singular value decomposition on it to obtain two orthogonal and normalized matrices [math]\displaystyle{ U }[/math] and [math]\displaystyle{ V }[/math], and a diagonal matrix [math]\displaystyle{ \Sigma }[/math]: [math]\displaystyle{ P = U\Sigma V^T }[/math]. Here, [math]\Sigma = diag(\sigma_1,\sigma_2,\cdots,\sigma_N)[/math], where [math]\sigma_1\geq\sigma_2\geq\cdots\sigma_N[/math] are the singular values of [math]\displaystyle{ P }[/math] and are arranged in descending order. [math]\displaystyle{ N }[/math] is the number of states of [math]\displaystyle{ P }[/math].

Approximate dynamical reversibility and effective Information

We can define the sum of the [math]\displaystyle{ \alpha }[/math] powers of the singular values (also known as the [math]\alpha[/math]-order Schatten norm of the matrix) as a measure of the approximate dynamical reversibility of the Markov chain, that is:

[math]\displaystyle{ \Gamma_{\alpha}\equiv \sum_{i = 1}^N\sigma_i^{\alpha} }[/math]

Here, [math]\alpha\in(0,2)[/math] is a specified parameter that acts as a weight or tendency to make [math]\Gamma_{\alpha}[/math] to reflect determinism or degeneracy more. Under normal circumstances, we take [math]\alpha = 1[/math], which can make [math]\Gamma_{\alpha}[/math] to achieve a balance between determinism and degeneracy.

In addition, the authors in the literature prove that there is an approximate relationship between [math]\displaystyle{ EI }[/math] and [math]\Gamma_{\alpha}[/math]:

[math]\displaystyle{ EI\sim \log\Gamma_{\alpha} }[/math]

Moreover, to a certain extent, [math]\Gamma_{\alpha}[/math] can be used instead of [math]\displaystyle{ EI }[/math] to measure the degree of causal effect of Markov chains. Therefore, the so-called causal emergence can also be understood as an emergence of dynamical reversibility.

Quantification of causal emergence without coarse-graining

However, the greatest value of this theory lies in the fact that emergence can be directly quantified without a coarse-graining strategy. If the rank of [math]\displaystyle{ P }[/math] is [math]\displaystyle{ r }[/math], that is, starting from the [math]\displaystyle{ (r + 1) }[/math]th singular value, all singular values are 0, then we say that the dynamics [math]\displaystyle{ P }[/math] has clear causal emergence, and the numerical value of causal emergence is:

[math]\displaystyle{ \Delta \Gamma_{\alpha} = \Gamma_{\alpha}(1/r - 1/N) }[/math]

If the matrix [math]\displaystyle{ P }[/math] is full rank, but for any given small number [math]\displaystyle{ \epsilon }[/math], there exists [math]\displaystyle{ r_{\epsilon} }[/math] such that starting from [math]\displaystyle{ (r_{\epsilon}+1) }[/math], all singular values are less than [math]\displaystyle{ \epsilon }[/math], then it is said that the system has a degree of vague causal emergence, and the numerical value of causal emergence is:

[math]\displaystyle{ \Delta \Gamma_{\alpha}(\epsilon) = \frac{\sum_{i = 1}^{r} \sigma_{i}^{\alpha}}{r} - \frac{\sum_{i = 1}^{N} \sigma_{i}^{\alpha}}{N} }[/math]

In summary, the advantage of this method for quantifying causal emergence is that it is more objectively without relying on a specific coarse-graining strategy. The disadvantage of this method is that to calculate [math]\Gamma_{\alpha}[/math], it is necessary to perform SVD decomposition on [math]\displaystyle{ P }[/math] in advance. As a result, the computational complexity is [math]O(N^3)[/math], which is higher than the computational complexity of [math]\displaystyle{ EI }[/math]. Moreover, [math]\Gamma_{\alpha}[/math] cannot be explicitly decomposed into two components: determinism and degeneracy.

Specific example

[math]\displaystyle{ EI }[/math]与[math]\displaystyle{ \Gamma }[/math]对比

The author gives four specific examples of Markov chains. The state transition matrix of this Markov chain is shown in the figure. We can compare the [math]\displaystyle{ EI }[/math] and approximate dynamical reversibility (the [math]\displaystyle{ \Gamma }[/math] in the figure, that is, [math]\displaystyle{ \Gamma_{\alpha = 1} }[/math]) of this Markov chain. Comparing figures a and b, we find that for different state transition matrices, when [math]\displaystyle{ EI }[/math] decreases, [math]\displaystyle{ \Gamma }[/math] also decreases simultaneously. Further, figures c and d are comparisons of the effects before and after coarse-graining. In figure d, the state transition matrix of figure c is coarse-grained by merging the first three states into a macroscopic state. Since the macroscopic state transition matrix in figure d is a deterministic system, the normalized [math]\displaystyle{ EI }[/math], [math]\displaystyle{ eff\equiv EI/\log N }[/math] and the normalized [math]\Gamma[/math]: [math]\displaystyle{ \gamma\equiv \Gamma/N }[/math] all reach the maximum value of 1.

Dynamic independence

Dynamic independence is a method to characterize the macroscopic dynamical state after coarse-graining such that it is independent of the microscopic dynamical state [39]. The core idea is that although macroscopic variables are composed of microscopic variables, when predicting the future state of macroscopic variables, only the historical information of macroscopic variables is needed, and no additional information from microscopic history is needed. This phenomenon is called dynamic independence by the author. It is another means of quantifying emergence. The macroscopic dynamics at this time is called emergent dynamics. The independence, causal dependence, etc. in the concept of dynamic independence can be quantified by transfer entropy.

Quantification of dynamic independence

Transfer entropy is a non-parametric statistic that measures the amount of directed (time-asymmetric) information transfer between two stochastic processes. The transfer entropy from process [math]\displaystyle{ X }[/math] to another process [math]\displaystyle{ Y }[/math] can be defined as the degree to which knowing the past values of [math]\displaystyle{ X }[/math] can reduce the uncertainty about the future value of [math]\displaystyle{ Y }[/math] given the past values of [math]\displaystyle{ Y }[/math]. The formula is as follows:

[math]\displaystyle{ T_t(X \to Y) = I(Y_t : X^-_t | Y^-_t) = H(Y_t | Y^-_t) - H(Y_t | Y^-_t, X^-_t) }[/math]

Here, [math]\displaystyle{ Y_t }[/math] represents the macroscopic variable at time [math]\displaystyle{ t }[/math], and [math]\displaystyle{ X^-_t }[/math] and [math]\displaystyle{ Y^-_t }[/math] represent the microscopic and macroscopic variables before time [math]\displaystyle{ t }[/math] respectively. [math]I[/math] is mutual information and [math]H[/math] is Shannon entropy. [math]\displaystyle{ Y }[/math] is dynamically decoupled with respect to [math]\displaystyle{ X }[/math] if and only if the transfer entropy from [math]\displaystyle{ X }[/math] to [math]\displaystyle{ Y }[/math] at time [math]\displaystyle{ t }[/math] is [math]\displaystyle{ T_t(X \to Y)=0 }[/math].

The concept of dynamic independence can be widely applied to a variety of complex dynamical systems, including neural systems, economic processes, and evolutionary processes. Through the coarse-graining method, the high-dimensional microscopic system can be simplified into a low-dimensional macroscopic system, thereby revealing the emergent structure in complex systems.

In the paper, the author conducts experimental verification in a linear system. The experimental process is: 1) Use the linear system to generate parameters and laws; 2) Set the coarse-graining function; 3) Obtain the expression of transfer entropy; 4) Optimize and solve the coarse-graining method of maximum decoupling (corresponding to minimum transfer entropy). Here, the optimization algorithm can use transfer entropy as the optimization goal, and then use the gradient descent algorithm to solve the coarse-graining function, or use the genetic algorithm for optimization.

Example

The paper gives an example of a linear dynamical system. Its dynamics is a vector autoregressive model. By using genetic algorithms to iteratively evolve different initial conditions, the degree of dynamic independence of the system can also gradually increase. At the same time, it is found that different coarse-graining scales will affect the degree of optimization to dynamic independence. The experiment finds that dynamic independence can only be achieved at certain scales, but not at other scales. Therefore, the choice of scale is also very important.

Comparison of several causal emergence theories

We can compare the above four different quantitative causal emergence theories from several different dimensions such as whether causality is considered, whether a coarse-graining function needs to be specified, the applicable dynamical systems, and quantitative indicators. The comparison is summarized in the following table:

Comparison of Different Quantitative Emergence Theories
Method Consider Causality? Involve Coarse-graining? Applicable Dynamical Systems Measurement Index
Hoel's causal emergence theory [1] Dynamic causality, the definition of EI introduces do-intervention Requires specifying a coarse-graining method Discrete Markov dynamics Dynamic causality: effective information
Rosas's causal emergence theory [36] Approximation by correlation characterized by mutual information When judged based on synergistic information, no coarse-graining is involved. When calculated based on unique information, a coarse-graining method needs to be specified. Arbitrary dynamics Information decomposition: synergistic information or unique information
Causal emergence theory based on reversibility [40] Dynamic causality, EI is equivalent to approximate dynamical reversibility Does not depend on a specific coarse-graining strategy Discrete Markov dynamics Approximate dynamical reversibility: [math]\displaystyle{ \Gamma_{\alpha} }[/math]
Dynamic independence [39] Granger causality Requires specifying a coarse-graining method Arbitrary dynamics Dynamic independence: transfer entropy

Identification of causal emergence

Some works on quantifying emergence through measures of causal effects and other information-theoretic indicators have been introduced previously. However, in practical applications, we often can only collect observational data and cannot obtain the true dynamics of the system. Therefore, identifying whether causal emergence has occurred in a system from observable data is a more important problem. The following section describes two identification methods of causal emergence. The first is the approximate method based on Rosas's causal emergence theory (the method based on mutual information approximation and the method based on machine learning). The second is the neural information squeezer (NIS, NIS+) method proposed by J.Zhang's group.

Approximate method based on Rosas's causal emergence theory

Rosas's causal emergence theory employs quantification method based on both synergistic information and unique information. The second method can bypass the combinatorial explosion problem of multivariate variables, but it depends on the selection of macroscopic variable [math]\displaystyle{ V }[/math] or coarse-graining method. To solve this problem, the authors give two solutions. One is to specify a macroscopic variable [math]\displaystyle{ V }[/math], and the other is a machine learning-based method that allows the system to automatically learn the macroscopic variable [math]\displaystyle{ V }[/math] by maximizing [math]\displaystyle{ \mathrm{\Psi} }[/math]. The following sections introduce these two methods in detail:

Method based on mutual information approximation

Although Rosas's causal emergence theory has given a strict definition of causal emergence. However it involves the combinatorial explosion problem of many variables in the calculation, making it difficult to apply this method to actual systems. To solve this problem, Rosas et al. bypassed the exact calculation of unique information and synergistic information [36] and proposed an approximate formula that only needs to calculate mutual information. They also derived a sufficient condition for determining the occurrence of causal emergence.

The authors proposed three new indicators based on mutual information, [math]\displaystyle{ \mathrm{\Psi} }[/math], [math]\displaystyle{ \mathrm{\Delta} }[/math] and [math]\displaystyle{ \mathrm{\Gamma} }[/math], which can be used to identify causal emergence, causal decoupling and downward causation in the system respectively. The specific calculation formulas of the three indicators are as follows:

  • Indicator for the judgement of causal emergence:

[math]\displaystyle{ \Psi_{t, t + 1}(V):=I\left(V_t ; V_{t + 1}\right)-\sum_j I\left(X_t^j ; V_{t + 1}\right) }[/math]

 

 

 

 

(1)

Here [math]\displaystyle{ X_t^j }[/math] represents the microscopic variable at time t in the j-th dimension, and [math]\displaystyle{ V_t ; V_{t + 1} }[/math] respectively represent macroscopic variables at two consecutive times. Rosas et al. defined that when [math]\displaystyle{ \mathrm{\Psi}\gt 0 }[/math], emergence occurs in the system; but when [math]\displaystyle{ \mathrm{\Psi}\lt 0 }[/math], we cannot determine whether emergence occurs, because this condition is only a sufficient condition for the occurrence of causal emergence.

[math]\displaystyle{ \Delta_{t, t + 1}(V):=\max _j\left(I\left(V_t ; X_{t + 1}^j\right)-\sum_i I\left(X_t^i ; X_{t + 1}^j\right)\right) }[/math]

When [math]\displaystyle{ \mathrm{\Delta}\gt 0 }[/math], there exists downward causation from macroscopic variable [math]\displaystyle{ V }[/math] to microscopic variable [math]\displaystyle{ X }[/math].

[math]\displaystyle{ \Gamma_{t, t + 1}(V):=\max _j I\left(V_t ; X_{t + 1}^j\right) }[/math]

When [math]\displaystyle{ \mathrm{\Delta}\gt 0 }[/math] and [math]\displaystyle{ \mathrm{\Gamma}=0 }[/math], causal emergence occurs in the system and there exists causal decoupling.

The reason why we can use [math]\displaystyle{ \mathrm{\Psi} }[/math] to identify the occurrence of causal emergence is that [math]\displaystyle{ \mathrm{\Psi} }[/math] is also the lower bound of unique information. We have the following relationship:

[math]\displaystyle{ Un(V_t;X_{t + 1}|X_t)\geq I\left(V_t ; V_{t + 1}\right)-\sum_j I\left(X_t^j ; V_{t + 1}\right)+Red(V_t, V_{t + 1};X_t) }[/math]

Since [math]\displaystyle{ Red(V_t, V_{t + 1};X_t) }[/math] is non-negative, we can thus propose a sufficient but not necessary condition: when [math]\displaystyle{ \Psi_{t, t + 1}(V)\gt 0 }[/math].

In summary, this method is relatively convenient to calculate because it is based on mutual information, and there is no assumption or requirement of Markov property for the dynamics of the system. However, this theory also has many shortcomings: 1) The indicators [math]\displaystyle{ \mathrm{\Psi} }[/math], [math]\displaystyle{ \mathrm{\Delta} }[/math] and [math]\displaystyle{ \mathrm{\Gamma} }[/math] based only on mutual information and do not account for causality; 2) The method provides only a sufficient condition for causal emergence; 3) The results depend heavily on the selection of macroscopic variables, with different choices leading to significantly different outcomes; 4) For systems with large amounts of redundant information or many variables, the computational complexity becomes very high. At the same time, since [math]\displaystyle{ \Psi }[/math] is an approximate calculation, there will be a very large error in high-dimensional systems, and it is also very easy to obtain negative values, making it difficult to reliably determine the occurrence of causal emergence.

To verify that the information related to macaque movement is an emergent feature of its cortical activity, Rosas et al. conducted an experiment using the electrocorticogram (ECoG) of macaques as the observational data of microscopic dynamics. To obtain the macroscopic state variable [math]\displaystyle{ V }[/math], the authors chose the time series data of the limb movement trajectory of macaques obtained by motion capture (MoCap). The ECoG consisted of 64 channels, while the MoCap data consisted of 3 channels. Since the original MoCap data does not satisfy the conditional independence assumption of the supervenience feature, they used partial least squares and support vector machine algorithms to infer the part of neural activity encoded in the ECoG signal related to predicting macaque behavior. They then speculated that this information is an emergent feature of potential neural activity. Finally, based on the microscopic state and the calculated macroscopic features, the authors verified the existence of causal emergence.

Machine learning-based method

Kaplanis et al. [40] use a representation learning method to learn the macroscopic state variable [math]\displaystyle{ V }[/math] by maximizing [math]\displaystyle{ \mathrm{\Psi} }[/math] (i.e., Equation 1). Specifically, the authors use a neural network [math]\displaystyle{ f_{\theta} }[/math] to learn the representation function that coarsens the microscopic input [math]\displaystyle{ X_t }[/math] into the macroscopic output [math]\displaystyle{ V_t }[/math]. At the same time, they use neural networks [math]\displaystyle{ g_{\phi} }[/math] and [math]\displaystyle{ h_{\xi} }[/math] to learn the calculation of mutual information, such as [math]\displaystyle{ I(V_t;V_{t + 1}) }[/math] and [math]\displaystyle{ \sum_i(I(V_{t + 1};X_{t}^i)) }[/math], respectively. Finally, this method optimizes the neural network by maximizing the difference between the two terms (i.e., [math]\displaystyle{ \mathrm{\Psi} }[/math]). The architecture diagram of this neural network system is shown in Figure a below.

Figure b shows a toy model example. The microscopic input [math]\displaystyle{ X_t(X_t^1,...,X_t^6) \in \{0,1\}^6 }[/math] has six dimensions, with each dimension being binary (0 or 1). [math]\displaystyle{ X_{t + 1} }[/math] is the output of [math]\displaystyle{ X_{t} }[/math] at the next moment. The macroscopic variable is [math]\displaystyle{ V_{t}=\oplus_{i = 1}^{5}X_t^i }[/math], where [math]\displaystyle{ \oplus_{i = 1}^{5}X_t^i }[/math] represents the result of adding the first five dimensions of the microscopic input [math]\displaystyle{ X_t }[/math] and taking the modulo 2. There exists a probability [math]\displaystyle{ \gamma }[/math] that the macroscopic states at two consecutive moments are equal ([math]\displaystyle{ p(\oplus_{j = 1..5}X_{t + 1}^j=\oplus_{j = 1..5}X_t^j)= \gamma }[/math]). The sixth dimension of the microscopic input at two consecutive moments is equal with [math]\displaystyle{ \gamma_{extra} }[/math] ([math]\displaystyle{ p(X_{t + 1}^6=X_t^6)= \gamma_{extra} }[/math]).

The results show that in the simple example shown in Figure b, by maximizing [math]\displaystyle{ \mathrm{\Psi} }[/math] through the model constructed in Figure a, the experiment finds that the learned [math]\displaystyle{ \mathrm{\Psi} }[/math] is approximately equal to the true groundtruth [math]\displaystyle{ \mathrm{\Psi} }[/math]. This outcome verifies the effectiveness of model learning. This system is effective in identifying the occurrence of causal emergence. However, this method also has the problem of being difficult to deal with complex multivariate situations. This is because the number of neural networks on the right side of the figure is proportional to the number of macroscopic and microscopic variable pairs. Therefore, the more the number of microscopic variables (dimensions), the more the number of neural networks will increase proportionally, which will lead to an increase in computational complexity. In addition, this method is only tested on very few cases, so it cannot be scaled up yet. Moreover, since the network calculates an approximate index of causal emergence, it inherits the limitations of the earlier approximate algorithms.

Neural information squeezer

In recent years, emerging artificial intelligence technologies have overcome a series of major problems. Machine learning methods incorporate carefully designed neural network structures and automatic differentiation technologies. These technologies enable the approximation of any function in a vast function space. Therefore, Zhang et al. proposed a data-driven method using neural networks to identify causal emergence from time series data [44][39]. This method can automatically extract effective coarse-graining strategies and macroscopic dynamics, overcoming various deficiencies of the Rosas's method [36]. This framework is based on Erik Hoel's theory of causal emergence.

In this work, the input consists of time series data [math]\displaystyle{ (X_1,X_2,...,X_T ) }[/math], where [math]\displaystyle{ X_t\equiv (X_t^1,X_t^2,…,X_t^p ) }[/math] and [math]\displaystyle{ p }[/math] denotes the dimensionality of the input data. The author assumes that this set of data is generated by a general stochastic dynamical system:

[math]\displaystyle{ \frac{d X}{d t}=f(X(t), \xi) }[/math]

[math]X(t)[/math] represents the microscopic state variable, [math]f[/math] denotes the microscopic dynamics, and [math]\displaystyle{ \xi }[/math] represents the noise in the system. However, [math]\displaystyle{ f }[/math] is unknown.

The causal emergence identification problem is defined as the following functional optimization problem:

[math]\displaystyle{ \begin{aligned}&\max_{\phi,f_{q},\phi^{\dagger}}\mathcal{J}(f_{q}),\\&s.t.\begin{cases}\parallel\hat{X}_{t + 1}-X_{t + 1}\parallel\lt \epsilon,\\\hat{X}_{t + 1}=\phi^{\dagger}\left(f_{q}(\phi(X_{t})\bigr)\right).\end{cases}\end{aligned} }[/math]

 

 

 

 

(2)

[math]\mathcal{J}[/math] represents the dimension-averaged [math]\displaystyle{ EI }[/math] (see the entry effective information), and [math]\displaystyle{ \mathrm{\phi} }[/math] is the coarse-graining strategy function. [math]\displaystyle{ f_{q} }[/math] denotes the macroscopic dynamics, and [math]\displaystyle{ q }[/math] represents the dimension of the coarsened macroscopic state. [math]\hat{X}_{t + 1}[/math] is the prediction of the microscopic state at time [math]\displaystyle{ t + 1 }[/math] by the entire framework. This prediction is obtained by applying the inverse coarse-graining function ([math]\phi^{\dagger}[/math]) on the macroscopic state [math]\displaystyle{ \hat{Y}_{t + 1} }[/math] at time [math]\displaystyle{ t + 1 }[/math]. Here [math]\hat{Y}_{t + 1}\equiv f_q(Y_t)[/math] represents the prediction of the macroscopic state at time [math]\displaystyle{ t + 1 }[/math] by the dynamics learner, based on the macroscopic state [math]Y_t[/math] at time [math]\displaystyle{ t }[/math]. The macroscopic state [math]Y_t\equiv \phi(X_t)[/math] at time [math]\displaystyle{ t }[/math] is obtained by coarse-graining the microscopic state [math]X_t[/math] using the function [math]\phi[/math]. Finally, the difference between [math]\hat{X}_{t + 1}[/math] and the real microscopic state data [math]X_{t + 1}[/math] is compared to obtain the microscopic prediction error.

The entire optimization framework is shown below:

NIS优化框架

The objective function, [math]\displaystyle{ EI }[/math], is a functional of [math]\phi,\hat{f}_q,\phi^{\dagger}[/math], making it difficult to optimize as [math]q[/math] is a hyperparameter. We need to use machine learning methods to solve it.

NIS

To identify causal emergence in the system, the author proposes a neural network architecture called neural information squeezer (NIS) [44]. The architecture is based on an encoder-dynamics learner-decoder framework, consisting of three components: coarse-graining the raw data into macrostates, fitting the macrodynamics, and performing a reverse coarse-graining operation (decoding the macrostates, combined with random noise, back into microstates). Among them, the authors use invertible neural network (INN) to construct the encoder and decoder, which approximately correspond to the coarse-graining function ϕ and the inverse coarse-graining function ϕ† respectively. Invertible neural networks are used because they can be inverted to approximate the inverse coarse-graining function (i.e., [math]\phi^{\dagger}\approx \phi^{-1}[/math]). This model framework can be regarded as a neural information squeezer. The model compresses noisy microscopic state data into a macroscopic state through a narrow information channel, discarding irrelevant information. This strengthens the causality of macroscopic dynamics and allows the model to get the predictions of the microscopic state via decoding. The model framework of the NIS method is shown in the following figure:

Specifically, the encoder function [math]\phi[/math] consists of two parts:

[math]\displaystyle{ \phi\equiv \chi\circ\psi }[/math]

Here [math]\psi[/math] is an invertible function implemented via an invertible neural network, and [math]\chi[/math] is a projection function that removes the last [math]\displaystyle{ p - q }[/math] dimensional components from the [math]\displaystyle{ p }[/math]-dimensional vector. Here, [math]\displaystyle{ p,q }[/math] represent the dimensions of the microscopic and macroscopic variables, respectively. [math]\displaystyle{ \circ }[/math] is the composition operation of functions.

The decoder is the function [math]\phi^{\dagger}[/math], which is defined as:

[math]\displaystyle{ \phi^{\dagger}(y)\equiv \psi^{-1}(y\bigoplus z) }[/math]

Here [math]z\sim\mathcal{Ν}\left (0,I_{p - q}\right )[/math] is a [math]p-q[/math]-dimensional gaussian random vector.

However, if we directly optimize the dimension-averaged EI, there will be certain difficulties. The article [44] does not directly optimize Equation 1 directly, but adopts a trick. To solve this problem, the authors separate the optimization process into two stages. The first stage is to minimize the microscopic state prediction error under the condition of a given macroscopic scale [math]\displaystyle{ q }[/math], that is, [math]\displaystyle{ \min _{\phi, f_q, \phi^{\dagger}}\left\|\phi^{\dagger}(Y(t + 1)) - X_{t + 1}\right\|\lt \epsilon }[/math] and obtain the optimal macroscopic state dynamics [math]f_q^\ast[/math]; the second stage is to search for the hyperparameter [math]\displaystyle{ q }[/math] to maximize the EI ([math]\mathcal{J}[/math]), that is, [math]\displaystyle{ \max_{q}\mathcal{J}(f_{q}^\ast) }[/math]. It has been proved that this method can effectively find macroscopic dynamics and coarse-graining functions, but it cannot truly maximize EI in advance.

In addition to being able to automatically identify causal emergence based on time series data, this framework also has good theoretical properties. There are two important theorems:

Theorem 1: The information bottleneck of the neural information squeezer, that is, for any bijection [math]\displaystyle{ \mathrm{\psi} }[/math], projection [math]\displaystyle{ \chi }[/math], macroscopic dynamics [math]\displaystyle{ f }[/math] and Gaussian noise [math]\displaystyle{ z_{p - q}\sim\mathcal{Ν}\left (0,I_{p - q}\right ) }[/math],

[math]\displaystyle{ I\left(Y_t;Y_{t + 1}\right)=I\left(X_t;{\hat{X}}_{t + 1}\right) }[/math]

always holds. This means that all the information discarded by the encoder is actually noise information unrelated to prediction.

Theorem 2: For a trained model, [math]\displaystyle{ I\left(X_t;{\hat{X}}_{t + 1}\right)\approx I\left(X_t;X_{t + 1}\right) }[/math].

Therefore, combining Theorem 1 and Theorem 2, we can obtain for a trained model:

[math]\displaystyle{ I\left(Y_t;Y_{t + 1}\right)\approx I\left(X_t;X_{t + 1}\right) }[/math]

Comparison with classical theories

The NIS framework has many similarities with the computational mechanics framework mentioned in the previous sections. NIS can be regarded as an [math]\displaystyle{ \epsilon }[/math]-machine. The set of all historical processes [math]\displaystyle{ \overleftarrow{S} }[/math] in computational mechanics can be regarded as a set of microscopic states. All [math]\displaystyle{ R \in \mathcal{R} }[/math] represent macroscopic states. The function [math]\displaystyle{ \eta }[/math] acts as a coarse-graining function, while [math]\displaystyle{ \epsilon }[/math] represents an effective coarse-graining strategy. [math]\displaystyle{ T }[/math] corresponds to an effective macroscopic dynamic. The characteristic of minimum randomness index characterizes the determinism of macroscopic dynamics and can be replaced by EI in causal emergence. At this point, the encoded macroscopic state converges to the effective state, which can be regarded as the causal state in computational mechanics.

At the same time, the NIS framework also has similarities with the G-emergence theory mentioned earlier. For example, NIS also adopts the idea of Granger causality: optimizing the effective macroscopic state by predicting the microscopic state at the next time step. However, there are several obvious differences between these two frameworks: a) In the G-emergence theory, the macroscopic state needs to be manually selected, while in NIS, the macroscopic state is obtained by automatically optimizing the coarse-graining strategy; b) NIS uses neural networks to predict future states, whereas G-emergence relies on autoregressive techniques to fit the data.

Computational examples

The authors of NIS conducted experiments in the spring oscillator model, and the results are shown in the following figure. Figure a demostrates that the results for the next moment align linearly with the iterative macroscopic dynamics, validating the model's effectiveness. Figure b illustrates that the learned and real dynamics align, further validating the model. Figure c illustrates the model's accuracy in multi-step predictions, as the predicted and real curves are closely aligned. Figure d shows the degree of causal emergence at different scales. Causal emergence is most significant at scale is 2, consistent with the real spring oscillator model, where only two states (position and velocity) are needed to describe the system.

弹簧振子模型1

NIS+

NIS, although it was the first to propose a framework to optimize EI to identify causal emergence in data, has some shortcomings: the authors divide the optimization process into two stages, but does not truly maximize the EI directly, that is, Equation 1. Therefore, Yang et al. [39] further improved this method and proposed the NIS+ scheme. By introducing reverse dynamics and reweighting technique, the original maximization of EI is transformed into maximizing its variational lower bound. This is achieved using variational inequality, enabling the direct optimization of the objective function.

Mathematical principles

Specifically, according to variational inequality and inverse probability weighting method, the constrained optimization problem given by Equation 2 can be transformed into the following unconstrained minimization problem:

[math]\displaystyle{ \min_{\omega,\theta,\theta'} \sum_{i = 0}^{T - 1}w(\boldsymbol{x}_t)||\boldsymbol{y}_t-g_{\theta'}(\boldsymbol{y}_{t + 1})||+\lambda||\hat{\boldsymbol{x}}_{t + 1}-\boldsymbol{x}_{t + 1}|| }[/math]

Here [math]\displaystyle{ g }[/math] represents the reverse dynamics, which can be approximated by a neural network and trained with the data of [math]y_{t + 1},y_{t}[/math] through the macroscopic state. [math]\displaystyle{ w(x_t) }[/math] is the inverse probability weight, and the specific calculation method is as follows:

[math]\displaystyle{ w(\boldsymbol{x}_t)=\frac{\tilde{p}(\boldsymbol{y}_t)}{p(\boldsymbol{y}_t)}=\frac{\tilde{p}(\phi(\boldsymbol{x}_t))}{p(\phi(\boldsymbol{x}_t))} }[/math]

Here [math]\displaystyle{ \tilde{p}(\boldsymbol{y}_{t}) }[/math] is the target distribution and [math]\displaystyle{ p(\boldsymbol{y}_{t}) }[/math] is the original distribution of the data, [math]\phi[/math] is the coarse-graining function.

Workflow and model architecture

The following figure shows the entire model framework of NIS+. Figure a is the input of the model: time series data, such as trajectory sequence, continuous image sequence, and EEG time series data; Figure c is the output of the model, which includes the degree of causal emergence, macroscopic dynamics, emergent patterns, and coarse-graining strategies; Figure b is the specific model architecture. Different from the NIS method, the reverse dynamics training and the reweighting technology are added.

NIS模型框架图

Case analysis

The article conducts experiments on datasets including SIR dynamics, the Boids model, Game of Life, and fMRI signals from the brain nervous system of human subjects. This section focuses on the bird flock and brain signal experiments for detailed analysis.

The following figure shows the experimental results of NIS+ learning the flocking behavior of the Boids model. (a) and (e) give the actual and predicted trajectories of bird flocks under different conditions. Specifically, the author divides the bird flock into two groups and compares the multi-step prediction results under different noise levels ([math]\displaystyle{ \alpha }[/math] is 0.001 and 0.4 respectively). The prediction is good when the noise is relatively small, and the prediction curve will diverge when the noise is relatively large. (b) shows that the mean absolute error (MAE) of multi-step prediction gradually increases as the radius r increases. (c) shows the change of the degree of causal emergence [math]\displaystyle{ \Delta J }[/math] and prediction error (MAE) under different dimensions (q) with the change of training epoch. The author finds that causal emergence is most significant when the macroscopic state dimension q = 8. (d) is the attribution analysis of macroscopic variables to microscopic variables, and the obtained significance map intuitively describes the learned coarse-graining function. Here, each macroscopic dimension can correspond to the spatial coordinate (microscopic dimension) of each bird. The darker the color, the higher the correlation. Here, the microscopic coordinates corresponding to the maximum correlation of each macroscopic state dimension are highlighted with orange dots. These attribution significance values are obtained by using the Integrated Gradient (referred to as IG) method. The horizontal axis represents the x and y coordinates of 16 birds in the microscopic state, and the vertical axis represents 8 macroscopic dimensions. The light blue dotted line distinguishes the coordinates of different individual Boids, and the blue solid line separates the two bird flocks. (f) and (g) represent the changing trends of the degree of causal emergence ΔJ and normalized error MAE under different noise levels. (f) represents the influence of changes in external noise (that is, adding observation noise to microscopic data) on causal emergence. (g) represents the influence of internal noise (represented by α, added by modifying the dynamics of the Boids model) on causal emergence. In (f) and (g), the horizontal line represents the threshold that violates the error constraint in Equation 1. When the normalized MAE is greater than the threshold of 0.3, the constraint is violated and the result is unreliable.

鸟群中的因果涌现

This set of experiments shows that NIS+ can learn macroscopic states and coarse-graining strategies by maximizing EI. This maximization improves the model's generalization ability to situations outside the training data range. The learned macroscopic state effectively identifies the average group behavior and can be attributed to individual positions using the gradient integration method. The degree of causal emergence increases with external noise but decreases with increase of internal noise. This result indicates that the model reduces external noise through coarse-graining but cannot mitigate internal noise.

The brain experiment uses fMRI data collected from two experiments involving 830 human subjects (the results are shown in the following picture). In the first group, subjects performed a visual task by watching a short movie clip, during which recordings were made. In the second group of experiments, they were asked to be in a resting state and the recording was completed. Since the original dimension is relatively high, the authors reduced the original 14000-dimensional data to 100 dimensions using the Schaefer atlas method, with each dimension corresponding to a brain region. After that, the authors learned these data through NIS+ and extracted the dynamics at six different macroscopic scales. Figure a shows the multi-step prediction error results at different scales. Figure b shows the comparison of EI of NIS and NIS+ methods on different macroscopic dimensions in the resting state and the visual task of watching movies. The authors found that in the visual task, causal emergence is most significant when the macroscopic state dimension is q = 1. Through attribution analysis, it is found that the visual area plays the largest role (Figure c), which is consistent with the real scene. Figure d shows different perspective views of brain region attribution. In the resting state, one macroscopic dimension is not enough to predict the microscopic time series data. The dimension with the largest causal emergence is between 3 and 7 dimensions.

脑神经系统中的因果涌现

These experiments demonstrates that NIS+ identifies causal emergence, discovers macroscopic dynamics, and develops coarse-graining strategies. Additionally, other experiments show that EI maximization enhances the model's out-of-distribution generalization ability.

Applications

This subsection explores the potential applications of causal emergence in various complex systems. These include biological systems, neural networks, brain nervous systems, and artificial intelligence (such as causal representation learning, reinforcement learning based on world models, and causal model abstraction). Additional applications include consciousness research and Chinese classical philosophy.

Causal emergence in complex networks

In 2020, Klein and Hoel improved the method of quantifying causal emergence on Markov chains to apply to complex networks [45]. The authors defined the Markov chain in the network with the help of random walkers. Placing random walkers on nodes is equivalent to intervening on nodes. They then defined the transition probability matrix between nodes based on the random walk probability. At the same time, the authors establish a connection between EI and the connectivity of the network. Connectivity is characterized by the uncertainty in the weights of outgoing and incoming edges of the nodes. Based on this, the EI in complex networks is defined. For detailed methods, refer to causal emergence in complex networks.

The authors conducted experimental comparisons in artificial networks such as random network (ER), the preferential attachment network model (PA), and four types of real networks. They found that, for ER networks, the degree of EI only depends on the connection probability [math]\displaystyle{ p }[/math], and as the network size increases, it will converge to the value [math]\displaystyle{ -\log_2p }[/math]. At the same time, a key finding shows that there is a phase transition point in the EI value, which approximately appears at the position where the average degree ([math]\displaystyle{ \lt k\gt }[/math]) of the network is equal to [math]\displaystyle{ \log_2N }[/math]. This also corresponds to the phase transition point where the random network structure does not contain more information as its scale increases with the increase of the connection probability. For preferential attachment model networks, the EI depends on the power-law exponent α of the network's degree distribution. When [math]\displaystyle{ \alpha\lt 1.0 }[/math], the degree of EI increases as the network size grows. Conversely, when [math]\displaystyle{ \alpha\gt 1.0 }[/math], the degree of EI decreases with network size. The case of [math]\displaystyle{ \alpha = 1.0 }[/math] corresponds to the scale-free network, making the critical boundary for network growth. For real networks, the authors found that biological networks have the lowest EI because they exhibit significant noise. However, we can remove this noise through effective coarse-graining, which makes biological networks show more significant causal emergence phenomena than other types of networks. On the other hand, technical type networks are sparser and non-degenerate, resulting in higher average efficiency, more specific node relationships, and the highest EI. However, it is challenging to increase the degree of causal emergence in these networks through coarse-graining.

In this article, the authors use the greedy algorithm to coarse-grain the network. However, for large-scale networks, this algorithm is very inefficient. Subsequently, Griebenow et al. [46] proposed a method based on spectral clustering to identify causal emergence in preferential attachment networks. Compared with the greedy algorithm and the gradient descent algorithm, the spectral clustering algorithm has less computation time and the causal emergence of the found macroscopic network is also more significant.

Application on biological networks

Furthermore, Klein et al. extended the method of causal emergence in complex networks to more biological networks. As mentioned earlier, biological networks have more noise, which makes it difficult for us to understand their internal operating principles. This noise arises both from inherent system noise and measurement or observation errors. Klein et al. [47] further explored the relationship and specific meanings among noise, degeneracy and determinism in biological networks, and drew some interesting conclusions.

For example, high determinism in gene regulatory networks (GRN) can be understood as one gene almost certainly leading to the expression of another gene. High degeneracy is also prevalent in biological systems throughout evolution. These two factors make it unclear at what scale biological systems should be analyzed to better understand their functions. Klein et al. [48] analyzed protein interaction networks of more than 1800 species and found that networks at macroscopic scales have less noise and degeneracy. At the same time, compared with nodes that do not participate in macroscopic scales, nodes in macroscopic scale networks are more resilient. Therefore, in order to meet the requirements of evolution, biological networks need to evolve macroscopic scales to increase certainty to enhance network resilience and improve the effectiveness of information transmission.

Hoel et al. in the article [49] further studied causal emergence in biological systems with the help of EI theory. The author applied EI to gene regulatory networks to identify the most informative heart development model to control the heart development of mammals. By quantifying the causal emergence in the largest connected component of the Saccharomyces cerevisiae gene network, the article reveals that informative macroscopic scales are ubiquitous in biology. It also shows that life mechanisms themselves often operate on macroscopic scales. This article also provides biologists with a computable tool to identify the most informative macroscopic scale, and can model, predict, control and understand complex biological systems on this basis.

Swain et al. in the article [50] explored the influence of the interaction history of ant colonies on task allocation and task switching, and used EI to study how noise spreads among ants. The results indicate that historical interactions between ant colonies affects task allocation, while specific interaction types determine noise levels. In addition, even when ants switch functional groups, the emergent cohesion of ant colonies can ensure the stability of the colony. At the same time, different functional ant colonies also play different roles in maintaining the cohesion of the colony.

Application on artificial neural networks

Marrow et al. [51] introduced EI into neural networks to quantify and track the changes in the causal structure during training. Here, EI is used to evaluate the degree of causal influence of nodes and edges on downstream targets of each layer. EI for each neural network layer is defined as:

[math]\displaystyle{ I(L_1;L_2|do(L_1=H^{max})) }[/math]

Here, [math]\displaystyle{ L_1 }[/math] and [math]\displaystyle{ L_2 }[/math] respectively represent the input and output layers connecting the neural network. Here, the input layer is intervened as a uniform distribution, and the mutual information between cause and effect is then calculated. EI can be decomposed into sensitivity and degeneracy. Here, sensitivity is defined as:

[math]\displaystyle{ \sum_{(i \in L_1,j \in L_2)}I(t_i;t_j|do(i=H^{max})) }[/math]

Here, [math]\displaystyle{ i }[/math] and [math]\displaystyle{ j }[/math] respectively represent any neuron combination in the input layer and output layer. [math]\displaystyle{ t_i }[/math] and [math]\displaystyle{ t_j }[/math] respectively represent the state combinations of neurons in the input and output layers. These states are observed after intervening [math]\displaystyle{ i }[/math] as the maximum entropy distribution, under the condition that the neural network mechanism remains unchanged. In other words, intervening input neuron [math]\displaystyle{ i }[/math] to follow a uniform distribution causes a change in the output neuron. Then this value measures the mutual information between the two.

Here, it should be distinguished from the definition of EI. Here, each neuron in the input layer is do-intervened separately, and then the mutual information calculated by each two neurons is accumulated as the definition of sensitivity. Degeneracy is obtained by the difference between EI and sensitivity and is defined as:

[math]\displaystyle{ I(L_1;L_2|do(L_1=H^{max}))-\sum_{(i \in L_1,j \in L_2)}I(t_i;t_j|do(i=H^{max})) }[/math].

Observing EI, including sensitivity and degeneracy changes during model training, reveals the model's generalization ability and aids scholars in understanding neural networks.

Application on brain nervous system

The brain nervous system is an emergent multi-scale complex system. Luppi et al. [52] used integrated information decomposition to reveal the synergistic workspace of human consciousness. The authors constructed a three-layer architecture of brain cognition, including: external environment, specific modules and synergistic global space. The working principle of the brain mainly includes three stages: the first stage is responsible for collecting information from multiple different modules into the workspace; the second stage involves integrating the collected information within the workspace; the third stage focuses on broadcasting global information to other parts of the brain. The authors conducted experiments on three types of fMRI data in different resting states, including 100 normal people, 15 subjects participating in anesthesia experiments (including three different states before anesthesia, during anesthesia and recovery), and 22 subjects with chronic disorders of consciousness (DOC). This article uses integrated information decomposition to obtain synergistic information and redundant information It then employs the revised integrated information value [math]\displaystyle{ \Phi_R }[/math] to calculate the synergy and redundancy values between each two brain regions, determining whether each brain region plays a greater role is synergy or redundancy. By comparing the data of conscious individuals, they found that regions with significantly reduced integrated information in unconscious individuals corresponded to brain regions dominated by synergistic information. Meanwhile, these regions all belonged to functional regions such as DMN (Default Mode Network), allowing the authors to locate brain regions with a significant effect on the occurrence of consciousness.

Application in artificial intelligence systems

Causal emergence theory has significant connections to artificial intelligence. These connections are evident in the following ways. First, the machine learning solution to the causal emergence identification problem is actually an application of causal representation learning. Second, technologies like maximizing EI are anticipated to have applications in causal machine learning.

Causal representation learning

Causal representation learning is an emerging field in artificial intelligence. It attempts to combine two important fields in machine learning: representation learning and causal inference. By integrating their respective advantages, it aims to automatically extract important features and uncover causal relationships hidden within the data [53]. Causal emergence identification based on EI can be equivalent to a causal representation learning task. Identifying the emergence of causal relationships from data is equivalent to learning the underlying causal relationships and causal mechanisms behind the data. Specifically, we can regard the macroscopic state as a causal variable and the macroscopic dynamics as a causal mechanism by analogy. The coarse-graining strategy can be seen as an encoding process or representation from the original data to the causal variable, while the EI serves as a measure of the strength of causal effect of the mechanism.

Since there are many similarities between the two, the techniques and concepts of the two fields can learn from each other. For example, causal representation learning technology can be applied to causal emergence identification. In turn, the learned abstract causal representation can be interpreted as a macroscopic state, thereby enhancing the interpretability of causal representation learning. However, there are also significant differences between the two, mainly including two points: 1) Causal representation learning assumes that there is a real causal mechanism behind it, and the data is generated by this causal mechanism. However, there may not be a "true causal relationship" between the emergent macroscopic states and dynamics; 2) The macroscopic state after coarse-graining in causal emergence is a low-dimensional description, but there is no such requirement in causal representation learning. From an epistemological perspective, there is no difference between the two, because what both do is to extract EI from observational data, so as to obtain a representation with stronger causal effect.

To better compare causal representation learning and causal emergence identification tasks, we list the following table:

Comparison of causal representation learning and causal emergence identification
Comparison Causal representation learning Causal emergence identification
Data The macroscopic states generated by certain causal mechanisms in real life Observations of microscopic states (time series)
Latent variable Causal representation Macroscopic state
Causal mechanism Causal mechanism Macroscopic dynamics
Mapping between data and latent variables Representation Coarse-graining function
Causal relationship optimization Prediction loss, disentanglement EI maximization
Goal Finding the optimal representation of the original data to ensure that an independent causal mechanism can be achieved through the representation Finding an effective coarse-graining strategy and macroscopic dynamics with strong causal effects

Application of effective information in causal machine learning

Causal emergence can enhance the performance of machine learning in out-of-distribution scenarios. The do-intervention introduced in [math]\displaystyle{ EI }[/math] captures the causal dependence in the data generation process and suppresses spurious correlations. This approach supplements machine learning algorithms based on associations and establishing a connection between [math]\displaystyle{ EI }[/math] and out-of-distribution generalization (Out Of Distribution, abbreviated as OOD) [54]. Due to the universality of EI, causal emergence can be applied to supervised machine learning to evaluate the strength of the causal relationship between the feature space [math]\displaystyle{ X }[/math] and the target space [math]\displaystyle{ Y }[/math]. This application improves the prediction accuracy from cause (feature) to effect (target). It is worth noting that direct fitting of observations from [math]\displaystyle{ X }[/math] to [math]\displaystyle{ Y }[/math] works well for standard prediction tasks with the i.i.d. assumption, which means that the training data and test data are independent and identically distributed. However, if samples are drawn from outside the training distribution, a generalization representation space from training to test environments must be learned. It is generally believed that causality generalizes better than statistical correlation [55]; thus, causal emergence theory can serve as a standard for embedding causal relationships in the representation space. The occurrence of causal emergence reveals the potential causal factors of the target, thereby producing a robust representation space for out-of-distribution generalization. Causal emergence may provide a unified representation measure for out-of-distribution generalization based on causal theory. [math]\displaystyle{ EI }[/math] can also be regarded as an information-theoretic abstraction of the out-of-distribution generalization's reweighting-based debiasing technique. In addition, we conjecture that out-of-distribution generalization can be achieved while maximizing [math]\displaystyle{ EI }[/math]. [math]\displaystyle{ EI }[/math] may reach its peak at the intermediate stage of the original feature abstraction, which aligns with the idea of OOD generalization—less is more. Ideally, when causal emergence occurs at the peak of [math]\displaystyle{ EI }[/math], all non-causal features are excluded and causal features are revealed, resulting in the most informative representation.

Causal model abstraction

In complex systems, noisy microscopic states are coarse-grained to produce macroscopic states with less noise, strengthening the causality of macroscopic dynamics. The same is true for causal models that explain various types of data. Due to the excessive complexity of the original model or limited computing resources, people often need to obtain a more abstract causal model. It is crucial to ensure that the abstract model maintains the causal mechanism of the original model as much as possible. This process is referred to as causal model abstraction.

Causal model abstraction belongs to a subfield of artificial intelligence and plays a particularly important role in causal inference and model interpretability. This abstraction can help us better understand the hidden causal mechanisms in the data and the interactions between variables. Causal model abstraction involves optimizing a high-level model to simulate the causal effects of a low-level model [56]. A high-level model that generalizes the causal effects of a low-level model is referred to as a causal abstraction of the low-level model.

Causal model abstraction discusses the interaction between causal relationships and model abstraction, which can be viewed as a coarse-graining process [57]. Therefore, causal emergence identification and causal model abstraction have many similarities. The original causal mechanism can be understood as microscopic dynamics, and the abstracted mechanism can be understood as macroscopic dynamics. In the neural information squeezer (NIS), researchers place restrictions on coarse-graining strategies and macroscopic dynamics. They require that the microscopic prediction error of macroscopic dynamics be small enough to exclude trivial solutions. This requirement aligns with causal model abstraction's goal of ensuring the abstracted causal model closely resembles the original model. However, there are also some differences between the two: 1) Causal emergence identification is to coarse-grain states or data, while causal model abstraction is to perform coarse-graining operations on models; 2) Causal model abstraction considers confounding factors, but this point is ignored in the discussion of causal emergence identification.

Reinforcement learning based on world models

Reinforcement learning based on world models assumes that the agent contains a world model to simulate the dynamics of its environment [58]. The dynamics of the world model can be learned through the interaction between the agent and the environment, thereby helping the agent to plan and make decisions in an uncertain environment. At the same time, in order to represent a complex environment, the world model must be a coarse-grained description of the environment. A typical world model architecture always contains an encoder and a decoder.

Reinforcement learning based on world models also has many similarities with causal emergence identification. The world model can also be regarded as a macroscopic dynamics. All states in the environment can be regarded as macroscopic states. These can be regarded as compressed states that ignore irrelevant information and can capture the most important causal features in the environment for the agent to make better decisions. In the planning process, the agent can also use the world model to simulate the dynamics of the real world.

The similarities and common features between the two fields can help us borrow ideas and techniques from one field to another. For example, an agent with a world model can interact with a complex system as a whole and obtain emergent causal laws from the interaction, thereby better helping us with the task of causal emergence identification. Maximizing EI technology can enhance the causal characteristics of the world model in reinforcement learning.

Other potential applications

In addition to the above application fields, the causal emergence theory may have significant potential applications for other important issues. For example, it has certain prospects in the research of consciousness issues and the modern scientific interpretation of Chinese classical philosophy.

Consciousness research

First of all, the proposal of the causal emergence theory is closely related to consciousness science. This is because the core indicator of the causal emergence theory, EI, was first proposed by Tononi in the quantitative theory of consciousness research, integrated information theory. After being modified, it was applied to Markov chains by Erik Hoel and the concept of causal emergence was proposed. Therefore, EI can be seen as a by-product of quantitative consciousness science.

Secondly, causal emergence, as an important concept in complex systems, also plays an important role in the research of consciousness science. A core question in consciousness research is whether consciousness is a macroscopic or a microscopic phenomenon? So far, there is no direct evidence to show the scale at which consciousness occurs. In-depth research on causal emergence, especially combined with experimental data of the brain nerve, may answer the scale at which consciousness occurs.

Thirdly, causal emergence may answer the question of free will. Questions about whether free will is real or an illusion are central to this discussion. In fact, if we accept the concept of causal emergence and admit that macroscopic variables will have causal power on microscopic variables, then all our decisions are actually made spontaneously by the brain system. Consciousness, in this view, is merely a certain level of explanation of this complex decision-making process. Therefore, free will is an emergent downward causation. The answers to these questions await further research of the causal emergence theory.

Chinese classical philosophy

Different from Western science and philosophy, Chinese classical philosophy retains a complete and different theoretical framework for explaining the universe. This framework includes yin and yang, the five elements, the eight trigrams, as well as practices like divination, feng shui, and traditional Chinese medicine, offering independent explanations for various phenomena in the universe. For a long time, the two sets of philosophies in the East and the West have always been difficult to integrate. The idea of causal emergence may provide a new explanation to bridge the conflict between Eastern and Western philosophies.

According to the causal emergence theory, the quality of a theory depends on the strength of causality, that is, the size of [math]\displaystyle{ EI }[/math]. Different coarse-graining schemes result in distinct macroscopic theories (macroscopic dynamics). It is very likely that when facing the same research object of complex systems, the Western philosophical and scientific system gives a set of relatively specific and microscopic causal mechanisms (dynamics), while Eastern philosophy gives a set of more coarsely grained macroscopic causal mechanisms. According to the causal emergence theory or the Causal Equivalence Principle proposed by Yurchenko, the two are completely likely to be compatible with each other. That is to say, for the same set of phenomena, the East and the West can make correct predictions and even intervention methods according to two different sets of causal mechanisms. It is also possible that in certain types of problems or phenomena, a more macroscopic causal mechanism is more explanatory or leads to a good solution. For some problems or phenomena, a more microscopic causal mechanism is more favorable.

For example, in Eastern philosophy, the five elements can be interpreted as macroscopic states, with their mutual generation and restraint representing a macroscopic causal mechanism. The process of identifying these five states from all phenomena is a coarse-graining process dependent on the observer's ability to analogize. Therefore, the theory of five elements can be regarded as an abstract causal emergence theory for everything. Similarly, the concept of causal emergence can be extended to fields like traditional Chinese medicine, divination, and feng shui. The common point of these applications will be that its causal mechanism is simpler and possibly has stronger causality compared to Western science. However, the process of obtaining such an abstract coarse-graining is more complex and more dependent on experienced abstractors. This explains why Eastern philosophies all emphasize the self-cultivation of practitioners. This is because Eastern philosophical theories place significant complexity and computational demands on analogical thinking, i.e., the coarse-graining process.

Critique

The theory of causal emergence, first proposed by Erik Hoel and others based on maximizing EI or SVD, has well characterized the phenomenon of emergence and has been widely applied in many actual systems. Some scholars, however, have pointed out that this theory still has many flaws and drawbacks. This is mainly reflected in the philosophical level and the technical level of the coarse-graining of Markov dynamical systems.

From the philosophical aspect

Throughout history, there has been a long-standing debate on the ontological and epistemological aspects of causality and emergence.

For example, Yurchenko pointed out in literature [59] that the concept of "causation" is often vague and should be distinguished into two different concepts, cause and reason, which conform to the ontological and epistemological causation respectively. Among them, cause refers to the real cause that fully leads to the effect, while reason is only the observer's explanation of the result. Reason may not be as strict as a real cause, but it does provide a certain degree of predictability. Similarly, there is also a debate about the nature of causal emergence.

Is causal emergence a real phenomenon that exists independently of a specific observer? Here it should be emphasized that for Hoel's theory, different coarse-graining strategies can lead to different macroscopic dynamical mechanisms and different measures of the strength of causal effects ([math]\displaystyle{ EI }[/math]). Essentially, different coarse-graining strategies can represent different observers. Hoel's theory links emergence with causality through intervention and introduces the concept of causal emergence in a quantitative way. Hoel's theory proposes a scheme to eliminate the influence of different coarse-graining strategy, that is, maximizing [math]\displaystyle{ EI }[/math]. The coarse-graining strategy that maximizes EI is the only objective scheme. Therefore, for a given set of Markov dynamics, only the coarse-graining strategy and corresponding macroscopic dynamics that maximize [math]\displaystyle{ EI }[/math] can be considered objective results. If multiple coarse-graining strategies maximize [math]\displaystyle{ EI }[/math] is not unique, that is, there are multiple coarse-graining strategies that can maximize [math]\displaystyle{ EI }[/math], it introduces theoretical difficulties and unavoidable subjectivity.

Dewhurst [60] provides a philosophical clarification of Hoel's theory, arguing that it is epistemological rather than ontological. This indicates that Hoel's macroscopic causality is only a causal explanation based on information theory and does not involve "true causality". This also raises questions about the assumption of uniform distribution (see the entry for EI), as there is no evidence that it should be superior to other distributions.

From the technical aspect

Non-uniqueness

The result of causal emergence is defined as the difference between the EI of the macroscopic dynamics after coarse-graining and the EI of the original microscopic dynamics. Therefore, this result obviously depends on the choice of the coarse-graining strategy. To address this uncertainty, Hoel and others proposed maximizing EI as the basis for judging and measuring causal emergence. However, there is currently no theoretical guarantee that this way of maximizing the EI of the macroscopic dynamics can ensure the uniqueness of the coarse-graining strategy. In other words, it is entirely possible that multiple coarse-graining strategies will correspond to the same EI of the macroscopic dynamics. In fact, studies on continuous mapping dynamical systems have already shown that there are infinitely many possibilities for the solutions of maximizing EI [61].

Ambiguity

At the same time, it is pointed out that Hoel's theory ignores the constraints on the coarse-graining strategy, and some coarse-graining strategies may lead to ambiguity and irrationality [62].

First, [62] highlights that ambiguity can arise in coarse-graining the TPM if no constraints are imposed on the strategy. For example, when the two row vectors in the TPM corresponding to the two states to be merged are very dissimilar, forcibly merging them (for example, by taking the average) will cause ambiguity. This ambiguity arises in determining the meaning of intervention on the merged macroscopic state. Because the row vectors are dissimilar, the macroscopic state's intervention cannot directly correspond to microscopic state interventions. If the intervention on the macroscopic state is forcibly converted into the intervention on the microscopic states by taking the average, the differences between the microscopic states are ignored. At the same time, new contradictory problems of non-commutativity will also be triggered.

Non-commutativity

If the two dissimilar row vectors are forcibly averaged, the resulting coarse-grained TPM may break the commutativity between the abstraction operation (i.e., coarse-graining) and marginalization (i.e., the time evolution operator). For example, assume that [math]\displaystyle{ A_{m\times n} }[/math] is a state coarse-graining operation (combining n states into m states). In this case, the coarse-graining strategy is defined as the one that maximizes the EI of the macroscopic state transition matrix. [math]\displaystyle{ (\cdot) \times (\cdot) }[/math] is a time coarse-graining operation (combining two time steps into one). In this way, [math]A_{m\times n}(TPM_{n\times n})[/math] is to perform coarse-graining on a [math]n\times n[/math] TPM, and the coarse-graining strategy is simplified as the product of matrix [math]A[/math] and matrix [math]TPM[/math].

Then, the commutativity condition of spatial coarse-graining and temporal coarse-graining is the following equation:

[math]\displaystyle{ A_{m\times n}(TPM_{n\times n}) \times A_{m\times n}(TPM_{n\times n}) = A_{m\times n}(TPM_{n\times n} \times TPM_{n\times n}) }[/math]

 

 

 

 

(3)

The left side represents first performing coarse-graining on the states of two consecutive time steps, and then multiplying the dynamics TPM of the two time steps together to obtain a transfer matrix for two-step evolution; the right side of the equation represents first multiplying the TPMs of two time steps together to obtain the two-step evolution of the microscopic state, and then using A for coarse-graining to obtain the macroscopic TPM. If this equation is not satisfied, it suggests that certain coarse-graining operations cause discrepancies between macroscopic and microscopic state evolutions. This implies that certain consistency constraints need to be imposed on the coarse-graining strategy, such as the lumpable conditions of Markov chains. See the entry of "Coarse-graining of Markov Chains".

However, as pointed out in the literature [39], the above problem can be alleviated by considering the error factor of the model while maximizing EI in the continuous variable space.

Other problems and remedies

Hoel's [math]\displaystyle{ EI }[/math] calculation and causal emergence quantification rely on two prerequisites: (1) known microscopic dynamics and (2) a known coarse-graining strategy. In practice, both factors are rarely available simultaneously, particularly in observational studies where they may be unknown. Therefore, this limitation hinders the practical applicability of Hoel's theory.

However, although machine learning techniques facilitate the learning of causal relationships, causal mechanisms, and the identification of emergent properties, a key question is whether the results reflect ontological causality and emergence. Alternatively, are they merely an epistemological phenomenon? This is still undecided. Although the introduction of machine learning does not necessarily solve the debate around ontological and epistemological causality and emergence, it can provide a dependence that helps reduce subjectivity. This is because the machine learning agent can be regarded as an "objective" observer who makes judgments about causality and emergence that are independent of human observers. However, the problem of a unique solution still exists in this method. Is the result of machine learning ontological or epistemological? The answer is that the result is epistemological, where the epistemic subject is the machine learning algorithm. However, this does not mean that all results of machine learning are meaningless. If the learning subject is well trained and the defined mathematical objective is effectively optimized, the result can also be considered objective because the algorithm itself is objective and transparent. Combining machine learning methods can help us establish a theoretical framework for observers and study the interaction between observers and the corresponding observed complex systems.

Related research fields

Several research fields are closely tied to causal emergence theory. This section explores the differences and connections with three related fields: reduction of dynamical models, dynamic mode decomposition, and simplification of Markov chains.

Dynamical model reduction

An important indicator of causal emergence is the selection of coarse-graining strategies. When the microscopic model is known, coarse-graining the microscopic state is equivalent to performing a model reduction on the microscopic model. Model reduction is an important subfield in control theory. Antoulas once wrote a related review article about model reduction for large-scale dynamical systems [63].

Model reduction simplifies high-dimensional system dynamics, describing the evolution of the original system with low-dimensional dynamics. This process is actually the coarse-graining process in the study of causal emergence. There are mainly two types of approximation methods for large-scale dynamical systems, namely approximation methods based on singular value decomposition [63][64] and approximation methods based on Krylov [63][65][66]. The former is based on singular value decomposition, and the latter is based on moment matching. Although the former has many ideal properties, including error bounds, it cannot be applied to systems with high complexity. On the other hand, the advantage of the latter is that it can be implemented iteratively and is therefore suitable for high-dimensional complex systems. Combining the advantages of these two methods gives rise to a third type of approximation method, namely the SVD/Krylov method [67][68]. Both methods evaluate the model reduction effect based on the error loss function of the output function before and after coarse-graining. Therefore, the goal of model reduction is to find the reduced parameter matrix that minimizes the error.

In general, the error loss function of the output function before and after model reduction can be used to judge the coarse-graining parameters. This process assumes that system reduction inherently results in information loss. Therefore, minimizing the error is the only way to judge the effectiveness of the reduction method. However, from the perspective of causal emergence, EI will increase due to dimensionality reduction. This highlights the primary difference between coarse-graining strategies in causal emergence research and model reduction in control theory. For stochastic systems [65], directly calculating the loss function can be unstable due to randomness, making reduction effectiveness harder to measure. The EI and causal emergence index based on stochastic dynamical systems can increase the effectiveness of evaluation indicators to a certain extent. Additionally, they make the control research of stochastic dynamical systems more rigorous.

Dynamical mode decomposition

In addition to the reduction of dynamical models, dynamic mode decomposition is also closely related to coarse-graining. The basic idea of the dynamic mode decomposition (DMD) [69][70] model is to directly obtain the dynamic information of the flow in the flow field from the data. It then finds the data mapping based on the flow field changes at different frequencies. This method transforms nonlinear infinite-dimensional dynamics into finite-dimensional linear dynamics, using Arnoldi method and singular value decomposition for dimensionality reduction. It draws on many key features of time series such as ARIMA, SARIMA and seasonal models, and is widely used in fields such as mathematics, physics, and finance [71]. DMD sorts the system according to frequency, extracting the eigen-frequency to analyze how flow structures at different frequencies contribute to the flow field. At the same time, the dynamic mode decomposition method can decompose the system with eigenvectors of different modes and then to predict the flow field. While being continuously applied, the dynamic mode decomposition algorithm has also been improved on its original foundation, such as combining it with SPA testing to verify the strong effectiveness of stock price prediction compared to benchmark points, and by linking the dynamic mode decomposition algorithm with spectral research to simulate the vibration patterns of the stock market in a circular economy. These applications can effectively collect and analyze data, ultimately yielding results.

Dynamic mode decomposition is a method of reducing the dimension of variables, dynamics, and observation functions simultaneously by using linear transformation [72]. This method resembles the coarse-graining strategy in causal emergence, prioritizing error minimization for optimization. Although both model reduction and dynamic mode decomposition are very close to coarse-graining method, they are not optimized based on EI. In essence, they both suffer to a certain degree of information loss and may not enhance causal effects. The authors of [73] demonstrated that the error minimization solution set includes the optimal solution set for maximizing EI. Therefore, if we want to optimize causal emergence, we can first minimize the error and find the best coarse-graining strategy in the error minimization solution set.

Simplification of Markov chains

The simplification of Markov Chains, also referred to as coarse-graining of Markov chains, is closely related to causal emergence. The coarse-graining process in causal emergence is essentially a model reduction of a Markov chain. Coarse-graining Markov processes [74] is an important problem for modeling a state transition system. It reduces the complexity of Markov chains by merging multiple states into one state.

There are mainly three meanings of simplification. First, when we study a very large-scale system, it is unnecessary to focus on the changes of each microscopic state. Coarse-graining filters out noise and heterogeneity to derive mesoscale or macroscopic laws from microscopic data. Second, some state transition probabilities are very similar, so they can be regarded as the same kind of state. Clustering (partitioning) this kind of states (also called lumping the states) to obtain a new smaller Markov chain can reduce the redundancy of system representation. Third, in reinforcement learning using Markov decision processes, coarse-graining the Markov chain can reduce the size of the state space and improve training efficiency. In much of the literature, coarse-graining and dimension reduction are equivalent [75].

Among them, there are two types of partitioning of the state space: hard partitioning and soft partitioning. Soft partitioning involves breaking up microscopic states, reconstructing macroscopic states, and allowing their superposition. Hard partitioning is a strict grouping of microscopic states, dividing several microscopic states into one group without allowing overlap and superposition (see coarse-graining of Markov chains).

The coarse-graining of Markov chains not only needs to be done on the state space, but also on the transition matrix. This involves simplifying the original transitional probability matrix (TPM) according to the state grouping to obtain a new, smaller TPM. In addition, the state vector needs to be reduced. Therefore, a complete coarse-graining process needs to consider the coarse-graining of states, TPM, and state vector at the same time. Thus, this raises the question of how to calculate TPM in the new Markov chain after states grouping. Another question is whether the normalization condition can be guaranteed.

In addition to these requirements, the coarse-graining operation of the TPM is typically required to be commutative with the operation of time evolution, i.e., the TPM itself. This condition ensures that the one-step evolution of the state vector through the coarse-grained TPM (equivalent to macroscopic dynamics) is equivalent to the procedure of first performing one-step transition matrix evolution on the state vector (equivalent to microscopic dynamics) and then applying coarse-graining. This puts forward the requirement of the consistency for both the state grouping (the coarse-graining process of the state) and the coarse-graining process of the TPM. This requirement of commutativity leads people to propose the requirement of lumpability of Markov chains.

For any hard partition of states, we can define the concept of lumpability. Lumpability is a requirement of clustering. This concept first appeared in Kemeny, Snell's Finite Markov Chains in 1969 [76]. Lumpability is a mathematical condition to determine whether a hard-blocked microscopic state grouping scheme is reducible to the macroscopic TPM. No matter which hard-blocking scheme the state space is partitioned according to, it has a corresponding coarse-graining scheme for the TPM and distributions [77].

Suppose a grouping method [math]\displaystyle{ S=\{A_1\cup A_2\cup ... ,A_r\} }[/math] is given for the state space [math]\displaystyle{ A }[/math]. Here [math]A_i[/math] is any subset of the state space [math]\displaystyle{ S }[/math] and satisfies [math]A_i\cap A_j= \Phi[/math], where [math]\Phi[/math] represents the empty set. [math]\displaystyle{ s^{(t)} }[/math] represents the microscopic state of the system at time [math]\displaystyle{ t }[/math]. The microscopic state space is [math]\displaystyle{ S=\{s_1, s_2,...,s_n\} }[/math] and the microscopic state [math]\displaystyle{ s_i\in S }[/math] are all independent elements in the Markov state space. Let the transition probability from microscopic state [math]\displaystyle{ s_k }[/math] to [math]\displaystyle{ s_m }[/math] be [math]\displaystyle{ p_{s_k \rightarrow s_m} = p(s^{(t)} = s_m | s^{(t-1)} = s_k) }[/math], and the transition probability from microscopic state [math]\displaystyle{ s_k }[/math] to macroscopic state [math]\displaystyle{ A_i }[/math] be [math]\displaystyle{ p_{s_k \rightarrow A_i} = p(s^{(t)} \in A_i | s^{(t-1)} = s_k) }[/math]. Then the necessary and sufficient condition for lumpability is that for any pair [math]\displaystyle{ A_i, A_j }[/math], [math]\displaystyle{ p_{s_k \rightarrow A_j} }[/math] of every state [math]\displaystyle{ s_k }[/math] belonging to [math]\displaystyle{ A_i }[/math] is equal, that is:

[math]\displaystyle{ \begin{aligned} p_{s_k \rightarrow A_j} = \sum_{s_m \in A_j} p_{s_k \rightarrow s_m} = p_{A_i \rightarrow A_j}, \forall s_k \in A_i \end{aligned} }[/math]

 

 

 

 

(4)

For specific methods of coarse-graining Markov chains, please refer to coarse-graining of Markov chains.

References

  1. 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 Hoel E P, Albantakis L, Tononi G. Quantifying causal emergence shows that macro can beat micro[J]. Proceedings of the National Academy of Sciences, 2013, 110(49): 19790-19795.
  2. 2.0 2.1 2.2 2.3 2.4 Hoel E P. When the map is better than the territory[J]. Entropy, 2017, 19(5): 188.
  3. Meehl P E, Sellars W. The concept of emergence[J]. Minnesota studies in the philosophy of science, 1956, 1239-252.
  4. 4.0 4.1 Holland J H. Emergence: From chaos to order[M]. OUP Oxford, 2000.
  5. Anderson P W. More is different: broken symmetry and the nature of the hierarchical structure of science[J]. Science, 1972, 177(4047): 393-396.
  6. Holland, J.H. Hidden Order: How Adaptation Builds Complexity; Addison Wesley Longman Publishing Co., Inc.: Boston, MA, USA, 1996.
  7. Reynolds, C.W. Flocks, herds and schools: A distributed behavioral model. In Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, Anaheim, CA, USA, 27–31 July 1987; pp. 25–34.
  8. Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. Emergent abilities of large language models. arXiv 2022, arXiv:2206.07682.
  9. 9.0 9.1 Bedau, M.A. Weak emergence. Philos. Perspect. 1997, 11, 375–399. [CrossRef]
  10. Bedau, M. Downward causation and the autonomy of weak emergence. Principia Int. J. Epistemol. 2002, 6, 5–50.
  11. 11.0 11.1 Harré, R. The Philosophies of Science; Oxford University Press: New York, NY, USA, 1985.
  12. 12.0 12.1 Baas, N.A. Emergence, hierarchies, and hyperstructures. In Artificial Life III, SFI Studies in the Science of Complexity, XVII; Routledge: Abingdon, UK, 1994; pp. 515–537.
  13. Newman, D.V. Emergence and strange attractors. Philos. Sci. 1996, 63, 245–261. [CrossRef]
  14. 14.0 14.1 Kim, J. ‘Downward causation’ in emergentism and nonreductive physicalism. In Emergence or Reduction; Walter de Gruyter: Berlin, Germany, 1992; pp. 119–138.
  15. 15.0 15.1 O’Connor, T. Emergent properties. Am. Philos. Q. 1994, 31, 91–104
  16. Fromm, J. Types and forms of emergence. arXiv 2005, arXiv:nlin/0506028
  17. Bedau, M.A.; Humphreys, P. Emergence: Contemporary Readings in Philosophy and Science; MIT Press: Cambridge, MA, USA, 2008.
  18. Yurchenko, S.B. Can there be a synergistic core emerging in the brain hierarchy to control neural activity by downward causation? TechRxiv 2023. [CrossRef]
  19. 19.0 19.1 Pearl J. Causality[M]. Cambridge university press, 2009.
  20. Granger C W. Investigating causal relations by econometric models and cross-spectral methods[J]. Econometrica: journal of the Econometric Society, 1969, 424-438.
  21. 21.0 21.1 Pearl J. Models, reasoning and inference[J]. Cambridge, UK: CambridgeUniversityPress, 2000, 19(2).
  22. Spirtes, P.; Glymour, C.; Scheines, R. Causation Prediction and Search, 2nd ed.; MIT Press: Cambridge, MA, USA, 2000.
  23. Chickering, D.M. Learning equivalence classes of Bayesian-network structures. J. Mach. Learn. Res. 2002, 2, 445–498.
  24. Eells, E. Probabilistic Causality; Cambridge University Press: Cambridge, UK, 1991; Volume 1
  25. Suppes, P. A probabilistic theory of causality. Br. J. Philos. Sci. 1973, 24, 409–410.
  26. 26.0 26.1 J. P. Crutchfield, K. Young, Inferring statistical complexity, Physical review letters 63 (2) (1989) 105.
  27. 27.0 27.1 A. K. Seth, Measuring emergence via nonlinear granger causality., in: alife, Vol. 2008, 2008, pp. 545–552.
  28. Crutchfield, J.P (1994). "The calculi of emergence: computation, dynamics and induction". Physica D: Nonlinear Phenomena. 75 (1–3): 11-54.
  29. Mnif, M.; Müller-Schloer, C. Quantitative emergence. In Organic Computing—A Paradigm Shift for Complex Systems; Springer: Basel, Switzerland, 2011; pp. 39–52.
  30. Fisch, D.; Jänicke, M.; Sick, B.; Müller-Schloer, C. Quantitative emergence–A refined approach based on divergence measures. In Proceedings of the 2010 Fourth IEEE International Conference on Self-Adaptive and Self-Organizing Systems, Budapest, Hungary, 27 September–1 October 2010; IEEE Computer Society: Washington, DC, USA, 2010; pp. 94–103.
  31. Fisch, D.; Fisch, D.; Jänicke, M.; Kalkowski, E.; Sick, B. Techniques for knowledge acquisition in dynamically changing environments. ACM Trans. Auton. Adapt. Syst. (TAAS) 2012, 7, 1–25. [CrossRef]
  32. Holzer, R.; De Meer, H.; Bettstetter, C. On autonomy and emergence in self-organizing systems. In International Workshop on Self-Organizing Systems, Proceedings of the Third International Workshop, IWSOS 2008, Vienna, Austria, 10–12 December 2008; Springer: Berlin/Heidelberg, Germany, 2008; pp. 157–169.
  33. Holzer, R.; de Meer, H. Methods for approximations of quantitative measures in self-organizing systems. In Proceedings of the Self-Organizing Systems: 5th International Workshop, IWSOS 2011, Karlsruhe, Germany, 23–24 February 2011; Proceedings 5; Springer: Berlin/Heidelberg, Germany, 2011; pp. 1–15.
  34. Teo, Y.M.; Luong, B.L.; Szabo, C. Formalization of emergence in multi-agent systems. In Proceedings of the 1st ACM SIGSIM Conference on Principles of Advanced Discrete Simulation, Montreal, QC, Canada, 19–22 May 2013; pp. 231–240.
  35. Szabo, C.; Teo, Y.M. Formalization of weak emergence in multiagent systems. ACM Trans. Model. Comput. Simul. (TOMACS) 2015, 26, 1–25. [CrossRef]
  36. 36.0 36.1 36.2 36.3 36.4 36.5 36.6 Rosas F E, Mediano P A, Jensen H J, et al. Reconciling emergences: An information-theoretic approach to identify causal emergence in multivariate data[J]. PLoS computational biology, 2020, 16(12): e1008289.
  37. 37.0 37.1 Williams P L, Beer R D. Nonnegative decomposition of multivariate information[J]. arXiv preprint arXiv:10042515, 2010.
  38. 38.0 38.1 P. A. Mediano, F. Rosas, R. L. Carhart-Harris, A. K. Seth, A. B. Barrett, Beyond integrated information: A taxonomy of information dynamics phenomena, arXiv preprint arXiv:1909.02297 (2019).
  39. 39.0 39.1 39.2 39.3 39.4 39.5 Barnett L, Seth AK. Dynamical independence: discovering emergent macroscopic processes in complex dynamical systems. Physical Review E. 2023 Jul;108(1):014304.
  40. 40.0 40.1 40.2 40.3 40.4 Zhang J, Tao R, Yuan B. Dynamical Reversibility and A New Theory of Causal Emergence. arXiv preprint arXiv:2402.15054. 2024 Feb 23.
  41. Tononi G, Sporns O. Measuring information integration[J]. BMC neuroscience, 2003, 41-20.
  42. Chvykov P; Hoel E. (2021). "Causal Geometry". Entropy. 23 (1): 24.
  43. Liu K; Yuan B; Zhang J (2024). "An Exact Theory of Causal Emergence for Linear Stochastic Iteration Systems". Entropy. 26 (8): 618.
  44. 44.0 44.1 44.2 Zhang J, Liu K. Neural information squeezer for causal emergence[J]. Entropy, 2022, 25(1): 26.
  45. Klein B, Hoel E. The emergence of informative higher scales in complex networks[J]. Complexity, 2020, 20201-12.
  46. Griebenow R, Klein B, Hoel E. Finding the right scale of a network: efficient identification of causal emergence through spectral clustering[J]. arXiv preprint arXiv:190807565, 2019.
  47. Klein B, Swain A, Byrum T, et al. Exploring noise, degeneracy and determinism in biological networks with the einet package[J]. Methods in Ecology and Evolution, 2022, 13(4): 799-804.
  48. Klein B, Hoel E, Swain A, et al. Evolution and emergence: higher order information structure in protein interactomes across the tree of life[J]. Integrative Biology, 2021, 13(12): 283-294.
  49. Hoel E, Levin M. Emergence of informative higher scales in biological systems: a computational toolkit for optimal prediction and control[J]. Communicative & Integrative Biology, 2020, 13(1): 108-118.
  50. Swain A, Williams S D, Di Felice L J, et al. Interactions and information: exploring task allocation in ant colonies using network analysis[J]. Animal Behaviour, 2022, 18969-81.
  51. Marrow S, Michaud E J, Hoel E. Examining the Causal Structures of Deep Neural Networks Using Information Theory[J]. Entropy, 2020, 22(12): 1429.
  52. Luppi AI, Mediano PA, Rosas FE, Allanson J, Pickard JD, Carhart-Harris RL, Williams GB, Craig MM, Finoia P, Owen AM, Naci L. A synergistic workspace for human consciousness revealed by integrated information decomposition. BioRxiv. 2020 Nov 26:2020-11.
  53. B. Sch ̈olkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, Y. Bengio, Toward causal representation learning, Proceedings of the IEEE 109 (5) (2021) 612–634.
  54. Yuan, B; Zhang, J; Lyu, A; Wu, J; Wang, Z; Yang, M; Liu, K; Mou, M; Cui, P (2024). "Emergence and causality in complex systems: A survey of causal emergence and related quantitative studies". Entropy. 26 (2): 108.
  55. Arjovsky, M.; Bottou, L.; Gulrajani, I.; Lopez-Paz, D. Invariant risk minimization. arXiv 2019, arXiv:1907.02893.
  56. Beckers, Sander, and Joseph Y. Halpern. "Abstracting causal models." Proceedings of the aaai conference on artificial intelligence. Vol. 33. No. 01. 2019.
  57. S. Beckers, F. Eberhardt, J. Y. Halpern, Approximate causal abstractions, in: Uncertainty in artificial intelligence, PMLR, 2020, pp. 606–615.
  58. D. Ha, J. Schmidhuber, World models, arXiv preprint arXiv:1803.10122 (2018).
  59. Yurchenko, S. B. (2023). Can there be a synergistic core emerging in the brain hierarchy to control neural activity by downward causation?. Authorea Preprints.
  60. Dewhurst, J. (2021). Causal emergence from EI: Neither causal nor emergent?. Thought: A Journal of Philosophy, 10(3), 158-168.
  61. Liu, K.W.; Yuan, B.; Zhang, J. (2024). "An Exact Theory of Causal Emergence for Linear Stochastic Iteration Systems". Entropy. 26 (8): 618.
  62. 62.0 62.1 Eberhardt, F., & Lee, L. L. (2022). Causal emergence: When distortions in a map obscure the territory. Philosophies, 7(2), 30.
  63. 63.0 63.1 63.2 Antoulas A C. An overview of approximation methods for large-scale dynamical systems[J]. Annual reviews in Control, 2005, 29(2): 181-190.
  64. Gallivan K, Grimme E, Van Dooren P. Asymptotic waveform evaluation via a Lanczos method[J]. Applied Mathematics Letters, 1994, 7(5): 75-80.
  65. 65.0 65.1 CHRISTIAN DE VILLEMAGNE & ROBERT E. SKELTON (1987) Model reductions using a projection formulation, International Journal of Control, 46:6, 2141-2169, DOI: 10.1080/00207178708934040
  66. Boley D L. Krylov space methods on state-space control models[J]. Circuits, Systems and Signal Processing, 1994, 13: 733-758.
  67. Gugercin S. An iterative SVD-Krylov based method for model reduction of large-scale dynamical systems[J]. Linear Algebra and its Applications, 2008, 428(8-9): 1964-1986.
  68. Khatibi M, Zargarzadeh H, Barzegaran M. Power system dynamic model reduction by means of an iterative SVD-Krylov model reduction method[C]//2016 IEEE Power & Energy Society Innovative Smart Grid Technologies Conference (ISGT). IEEE, 2016: 1-6.
  69. Schmid P J. Dynamic mode decomposition and its variants[J]. Annual Review of Fluid Mechanics, 2022, 54(1): 225-254.
  70. J. Proctor, S. Brunton and J. N. Kutz, Dynamic mode decomposition with control, arXiv:1409.6358
  71. J. Grosek and J. N. Kutz, Dynamic mode decomposition for real-time background/foreground separation in video, arXiv:1404.7592.
  72. B. Brunton, L. Johnson, J. Ojemann and J. N. Kutz, Extracting spatial-temporal coherent patterns in large-scale neural recordings using dynamic mode decomposition arXiv:1409.5496
  73. Liu K, Yuan B, Zhang J. An Exact Theory of Causal Emergence for Linear Stochastic Iteration Systems[J]. arXiv preprint arXiv:2405.09207, 2024.
  74. Zhang A, Wang M. Spectral state compression of markov processes[J]. IEEE transactions on information theory, 2019, 66(5): 3202-3231.
  75. Coarse graining. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Coarse_graining&oldid=16170
  76. Kemeny, John G., and J. Laurie Snell. Finite markov chains. Vol. 26. Princeton, NJ: van Nostrand, 1969. https://www.math.pku.edu.cn/teachers/yaoy/Fall2011/Kemeny-Snell_Chapter6.3-4.pdf
  77. Buchholz, Peter. "Exact and ordinary lumpability in finite Markov chains." Journal of applied probability 31.1 (1994): 59-75.


Editor's recommendation

Here are some links that can help readers better understand the relevant information of causal emergence:

Causal emergence reading club

The reading club deepens our understanding of concepts such as causality and emergence by reading cutting-edge literature. It focuses on seeking research directions that combine concepts like causality, emergence and multi-scale, and also explores the research direction of automatic multi-scale modeling of complex systems.

Share some recently developed theories and tools, including causal emergence theory, machine learning-driven renormalization technology, and a cross-scale analysis framework being developed by self-referential dynamics.

Emergence phenomenon is undoubtedly one of the most mysterious phenomena among many phenomena in complex systems. The "causal emergence" theory proposed by Erik Hoel provides a new possible explanatory approach for this wonderful cross-level emergence phenomenon. Through cross-level coarse-graining (or renormalization) operations, we can obtain completely different dynamics on the same dynamic system at different scales. Through the review of this season's reading club, we hope to explore the frontier progress in this emerging field and derive more new research topics.

The combination of emergence and causality creates the concept of causal emergence. This is a theoretical system that uses causality to quantitatively characterize emergence. In this season's reading club, by reading frontier literature, we deepen our understanding of concepts such as causality and emergence; focus on finding research directions that combine concepts such as causality and emergence, and multi-scale; and explore the research direction of multi-scale automatic modeling of complex systems. The second season reading club is more focused on discussing the relationship between causal inference and causal emergence, and quantitatively characterizing emergence, focusing on finding research directions that combine concepts such as causality and emergence, and multi-scale; and exploring the research direction of multi-scale automatic modeling of complex systems.

In the third season reading club of causal emergence, we will further conduct in-depth learning and discussion around the core research questions of causal emergence, namely "the definition of causal emergence" and "the identification of causal emergence". We will conduct in-depth discussions and analyses on the core theories of causal emergence such as Causal Emergence and Causal Geometry proposed by Erik Hoel, and carefully sort out the methodologies involved, including learning and drawing on relevant research ideas from other research fields such as dynamic reduction and latent space dynamic learning. Finally, we will discuss the applications of causal emergence, including expanding on issues such as biological networks, brain networks, or emergence detection to explore more practical application scenarios.

Path recommendation

This entry is written by Wang Zhipeng, Zhang Jiang, and Liu Kaiwei, and proofread and revised by Zhang Jiang and Wang Zhipeng.

The content of this entry is derived from wikipedia and public materials and complies with the CC3.0 agreement.