Effective Information
Effective Information
Effective Information (EI) is a core concept in the theory of Causal Emergence, used to measure the strength of causal effects in Markov dynamics. In this context, causal effect refers to the extent to which different input distributions lead to different output distributions when viewing the dynamics as a black box. The degree of this connection is the causal effect. EI can typically be decomposed into two components: Determinism and Degeneracy. Determinism indicates how well the next state of the system can be predicted from its previous state, while Degeneracy refers to how well one can infer the previous state from the next state. A system with higher Determinism or lower Degeneracy will have higher Effective Information. In this page, all [math]log[/math] represent logarithmic operations with a base of 2. Historical Background The concept of Effective Information (EI) was first introduced by Giulio Tononi in 2003 as a key measure in Integrated Information Theory[1]. A system is said to have a high degree of integration when there is a strong causal connection among its components, and EI is the metric used to quantify this degree of causal connection. In 2013, Giulio Tononi's student, Erik Hoel, further refined the concept of EI to quantitatively characterize emergence, leading to the development of the theory of Causal Emergence[2]. In this theory, Hoel used Judea Pearl’s "do" operator to modify the general mutual information metric, which made EI fundamentally different from mutual information. While mutual information measures correlation, EI—due to the use of the "do" operator—measures causality. The article also introduced a normalized version of EI, referred to as Eff. Traditionally, EI was primarily applied to discrete-state Markov chains. To extend this to continuous domains, P. Chvykov and E. Hoel collaborated in 2020 to propose the theory of Causal Geometry, expanding EI's definition to function mappings with continuous state variables. By incorporating Information Geometry, they explored a perturbative form of EI and compared it with Fisher Information, proposing the concept of Causal Geometry. However, this method of calculating EI for continuous variables required the assumption of infinitesimal variance for normal distribution variables, which was an overly stringent condition. In 2022, to address the calculation of EI in general feedforward neural networks, Zhang Jiang and Liu Kaiwei removed the variance constraint from the Causal Geometry approach and explored a more general form of EI. Nonetheless, a limitation remained: because the uniform distribution of variables in the real-number domain is strictly defined over an infinite space, the calculation of EI involved a parameter [math]L[/math], representing the range of the uniform distribution. To avoid this issue and enable comparisons of EI at different levels of granularity, the authors proposed the concept of dimension-averaged EI. They found that the measure of causal emergence defined by dimension-averaged EI was solely dependent on the determinant of the neural network's Jacobian matrix and the variance of the random variables in the two compared dimensions, independent of other parameters such as [math]L[/math]. Additionally, dimension-averaged EI could be viewed as a normalized EI, or Eff. Essentially, EI is a quantity that depends only on the dynamics of a Markov system—specifically on the Markov state transition matrix—and is independent of the distribution of state variables. However, this point was not previously highlighted. In a 2024 review by Yuan Bing and others, the authors emphasized this fact and provided an explicit form of EI that depends only on the Markov state transition matrix. In their latest work on dynamical reversibility and causal emergence, Zhang Jiang and colleagues pointed out that EI is actually a characterization of the reversibility of the underlying Markov transition matrix, and they attempted to directly characterize the reversibility of Markov chain dynamics as a replacement for EI. Overview The EI metric is primarily used to measure the strength of causal effects in Markov dynamics. Unlike general causal inference theories, EI is used in cases where the dynamics (the Markov transition probability matrix) are known and no unknown variables (i.e., confounders) are present. Its core objective is to measure the strength of causal connections, rather than the existence of causal effects. This means EI is more suitable for scenarios where a causal relationship between variables X and Y is already established. Formally, EI is a function of the causal mechanism (in a discrete-state Markov chain, this is the probability transition matrix of the Markov chain) and is independent of other factors. The formal definition of EI is: EI(P)≡I(Y;X∣do(X∼U)) where P represents the causal mechanism from X to Y, which is a probability transition matrix, [math]p_{ij} \equiv Pr(Y=j|X=i)[/math]; X is the cause variable, Y is the effect variable, and [math]do(X \sim U)[/math] denotes the intervention on X, changing its distribution to a uniform one. Under this intervention, and assuming the causal mechanism P remains unchanged, Y will be indirectly affected by the intervention on X. EI measures the mutual information between X and Y after this intervention. The introduction of the "do" operator aims to eliminate the influence of X's distribution on EI, ensuring that the final EI metric is only a function of the causal mechanism f and is independent of X's distribution. Below are three examples of Markov chains, with their respective EI values included: Markov Chain Examples
[math]\displaystyle{ P_1=\begin{pmatrix} &0 &0 &1 &0& \\ &1 &0 &0 &0& \\ &0 &0 &0 &1& \\ &0 &1 &0 &0& \\ \end{pmatrix} }[/math], [math]\displaystyle{ P_2=\begin{pmatrix} &1/3 &1/3 &1/3 &0& \\ &1/3 &1/3 &1/3 &0& \\ &1/3 &1/3 &1/3 &0& \\ &0 &0 &0 &1& \\ \end{pmatrix} }[/math], [math]\displaystyle{ P_3=\begin{pmatrix} &0 &0 &1 &0& \\ &1 &0 &0 &0& \\ &1 &0 &0 &0& \\ &1 &0 &0 &0& \\ \end{pmatrix} }[/math].
[math]\begin{aligned}&EI(P_1)=2\ bits,\\&Det(P_1)=2\ bits,\\&Deg(P_1)=0\ bits\end{aligned}[/math] [math]\begin{aligned}&EI(P_2)=0.81\ bits,\\&Det(P_2)=0.81\ bits,\\&Deg(P_2)=0\ bits\end{aligned}[/math] [math]\begin{aligned}&EI(P_3)=0.81\ bits\\&Det(P_3)=2\ bits,\\&Deg(P_3)=1.19\ bits.\end{aligned}[/math]
(example)
As we can see, the EI of the first matrix [math]P_1[/math] is higher than that of the second [math]P_2[/math] because this probability transition is fully deterministic: starting from a particular state, it transitions to another state with 100% probability. However, not all deterministic matrices correspond to high EI, such as matrix [math]P_3[/math]. Although its transition probabilities are also either 100% or 0, because all of the last three states transition to the first state, we cannot distinguish which state it was in the previous moment. Therefore, its EI is low, which we call degeneracy. Hence, if a transition matrix has high determinism and low degeneracy, its EI will be high. Additionally, EI can be decomposed as follows: EI=Det−Deg where Det stands for Determinism, and Deg stands for Degeneracy. EI is the difference between the two. In the table, we also list the values of Det and Deg corresponding to the matrices. The first transition probability matrix is a permutation matrix and is reversible; thus, it has the highest determinism, no degeneracy, and therefore the highest EI. The second matrix's first three states transition to each other with a 1/3 probability, resulting in the lowest determinism but also low degeneracy, yielding an EI of 0.81. The third matrix, despite having binary transitions, has high degeneracy because all three states transition to state 1, meaning we cannot infer their previous state. Thus, its EI equals that of the second matrix at 0.81. Although EI was originally applied to discrete-state Markov chains, Zhang Jiang, Liu Kaiwei, and Yang Mingzhe extended the definition to more general continuous-variable cases. This extension builds on EI's original definition by intervening on the cause variable X as a uniform distribution over a sufficiently large bounded interval, [math][-\frac{L}{2}, \frac{L}{2}]^n[/math]. The causal mechanism is assumed to be a conditional probability that follows a Gaussian distribution with a mean function [math]f(x)[/math] and covariance matrix [math]\Sigma[/math]. Based on this, the EI between the causal variables is then measured. The causal mechanism here is determined by the mapping [math]f(x)[/math] and the covariance matrix, which together define the conditional probability [math]Pr(y|x)[/math]. More detailed explanations follow. The Do-Operator and Its Explanation The original definition of effective information (EI) was based on discrete Markov chains. However, to expand its applicability, we explore a more general form of EI here. Formal Definition Consider two random variables, X and Y, representing the cause variable and the effect variable, respectively. Let their value ranges be X and Y. The effective information (EI) from X to Y is defined as: EI≡I(X:Y∣do(X∼U(X)))≡I(X~:Y~) Here, do(X∼U(X)) represents applying a do-intervention (or do-operator) on X, making it follow a uniform distribution U(X) over X, which corresponds to a maximum entropy distribution. X~ and Y~ represent the variables after the do-intervention on X and Y, respectively, where: Pr(X~=x)=#(X)1 This means that the main difference between X~ after the intervention and X before the intervention is their distributions: X~ follows a uniform distribution over X, while X may follow any arbitrary distribution. #(X) represents the cardinality of the set X, or the number of elements in the set if it is finite. According to Judea Pearl’s theory, the do-operator cuts off all causal arrows pointing to variable X, while keeping other factors unchanged, particularly the causal mechanism from X to Y. The causal mechanism is defined as the conditional probability of Y taking any value y∈Y given X takes a value x∈X: f≡Pr(Y=y∣X=x) In the intervention, this causal mechanism f remains constant. When no other variables are influencing the system, this leads to a change in the distribution of Y, which is indirectly intervened upon and becomes: Pr(Y~=y)=x∈X∑Pr(X=x)Pr(Y=y∣X=x)=x∈X∑#(X)Pr(Y=y∣X=x) Here, Y~ represents the modified distribution of Y after the do-intervention on X, reflecting how the distribution of Y changes indirectly due to the intervention on X. Therefore, the effective information (EI) of a causal mechanism f is the mutual information between the intervened cause variable X~ and the intervened effect variable Y~. Why Use the Do-Operator? While EI is essentially a measure of mutual information, it differs from traditional mutual information by including the do-operator, which applies an intervention to the input variable. Why is this intervention necessary? According to Judea Pearl’s ladder of causality, causal inference can be divided into three levels: association, intervention, and counterfactuals. The higher the level, the stronger the causal features. Directly estimating mutual information from observational data measures the level of association. If we can intervene in the variables, i.e., set a variable to a specific value or make it follow a particular distribution, we move up to the intervention level. By introducing the do-operator in the definition of EI, we allow EI to capture causal features more effectively than mutual information alone. From a practical perspective, incorporating the do-operator in EI’s calculation separates the data from the dynamics, eliminating the effect of the data distribution (i.e., the distribution of X) on the EI measurement. In causal graphs, the do-operator cuts off all causal arrows pointing to the intervened variable, preventing confounding factors from creating spurious associations. Similarly, in EI’s definition, the do-operator removes all causal arrows pointing to the cause variable X, including influences from other variables (both observable and unobservable). This ensures that EI captures the intrinsic characteristics of the dynamics itself. The introduction of the do-operator makes EI distinct from other information metrics. The key difference is that EI is solely a function of the causal mechanism, which allows it to more precisely capture the essence of causality compared to other metrics like transfer entropy. However, this also means that EI requires knowledge of or access to the causal mechanism, which may be challenging if only observational data is available.