# AIXI模型

AIXI 模板:IPA-all is a theoretical mathematical formalism for artificial general intelligence. It combines Solomonoff induction with sequential decision theory. AIXI was first proposed by Marcus Hutter in 2000 and several results regarding AIXI are proved in Hutter's 2005 book Universal Artificial Intelligence.

AIXI is a theoretical mathematical formalism for artificial general intelligence. It combines Solomonoff induction with sequential decision theory. AIXI was first proposed by Marcus Hutter in 2000 and several results regarding AIXI are proved in Hutter's 2005 book Universal Artificial Intelligence.

AIXI 是人工一般智能的理论数学形式主义。它将所罗门诺夫归纳法与序贯决策理论相结合。AIXI 是由 Marcus Hutter 于2000年首次提出的，关于 AIXI 的一些结果在 Hutter 于2005年出版的《通用人工智能》一书中得到了证实。

AIXI is a reinforcement learning agent. It maximizes the expected total rewards received from the environment. Intuitively, it simultaneously considers every computable hypothesis (or environment). In each time step, it looks at every possible program and evaluates how many rewards that program generates depending on the next action taken. The promised rewards are then weighted by the subjective belief that this program constitutes the true environment. This belief is computed from the length of the program: longer programs are considered less likely, in line with Occam's razor. AIXI then selects the action that has the highest expected total reward in the weighted sum of all these programs.

AIXI is a reinforcement learning agent. It maximizes the expected total rewards received from the environment. Intuitively, it simultaneously considers every computable hypothesis (or environment). In each time step, it looks at every possible program and evaluates how many rewards that program generates depending on the next action taken. The promised rewards are then weighted by the subjective belief that this program constitutes the true environment. This belief is computed from the length of the program: longer programs are considered less likely, in line with Occam's razor. AIXI then selects the action that has the highest expected total reward in the weighted sum of all these programs.

AIXI 是强化学习的代理商。它最大限度地实现了从环境中获得的预期总回报。直观上，它同时考虑每个可计算的假设(或环境)。在每一个时间步骤中，它查看每一个可能的计划，并根据下一步行动评估该计划产生的奖励数量。然后，承诺的奖励是由这个项目构成真实环境的主观信念加权的。这种看法是根据程序的长度计算出来的: 较长的程序被认为是不太可能的，这与 Occam 的剃刀理论一致。然后 AIXI 选择在所有这些方案的加权和中期望总回报最高的动作。

# = 定义 =

AIXI is a reinforcement learning agent that interacts with some stochastic and unknown but computable environment $\displaystyle{ \mu }$. The interaction proceeds in time steps, from $\displaystyle{ t=1 }$ to $\displaystyle{ t=m }$, where $\displaystyle{ m \in \mathbb{N} }$ is the lifespan of the AIXI agent. At time step t, the agent chooses an action $\displaystyle{ a_t \in \mathcal{A} }$ (e.g. a limb movement) and executes it in the environment, and the environment responds with a "percept" $\displaystyle{ e_t \in \mathcal{E} = \mathcal{O} \times \mathbb{R} }$, which consists of an "observation" $\displaystyle{ o_t \in \mathcal{O} }$ (e.g., a camera image) and a reward $\displaystyle{ r_t \in \mathbb{R} }$, distributed according to the conditional probability $\displaystyle{ \mu(o_t r_t | a_1 o_1 r_1 ... a_{t-1} o_{t-1} r_{t-1} a_t) }$, where $\displaystyle{ a_1 o_1 r_1 ... a_{t-1} o_{t-1} r_{t-1} a_t }$ is the "history" of actions, observations and rewards. The environment $\displaystyle{ \mu }$ is thus mathematically represented as a probability distribution over "percepts" (observations and rewards) which depend on the full history, so there is no Markov assumption (as opposed to other RL algorithms). Note again that this probability distribution is unknown to the AIXI agent. Furthermore, note again that $\displaystyle{ \mu }$ is computable, that is, the observations and rewards received by the agent from the environment $\displaystyle{ \mu }$ can be computed by some program (which runs on a Turing machine), given the past actions of the AIXI agent.

AIXI is a reinforcement learning agent that interacts with some stochastic and unknown but computable environment \mu. The interaction proceeds in time steps, from t=1 to t=m, where m \in \mathbb{N} is the lifespan of the AIXI agent. At time step t, the agent chooses an action a_t \in \mathcal{A} (e.g. a limb movement) and executes it in the environment, and the environment responds with a "percept" e_t \in \mathcal{E} = \mathcal{O} \times \mathbb{R}, which consists of an "observation" o_t \in \mathcal{O} (e.g., a camera image) and a reward r_t \in \mathbb{R}, distributed according to the conditional probability \mu(o_t r_t | a_1 o_1 r_1 ... a_{t-1} o_{t-1} r_{t-1} a_t), where a_1 o_1 r_1 ... a_{t-1} o_{t-1} r_{t-1} a_t is the "history" of actions, observations and rewards. The environment \mu is thus mathematically represented as a probability distribution over "percepts" (observations and rewards) which depend on the full history, so there is no Markov assumption (as opposed to other RL algorithms). Note again that this probability distribution is unknown to the AIXI agent. Furthermore, note again that \mu is computable, that is, the observations and rewards received by the agent from the environment \mu can be computed by some program (which runs on a Turing machine), given the past actions of the AIXI agent.

AIXI 是一个强化学习的代理人，与一些随机和未知但可计算的环境互动。交互按照时间步骤进行，从 t = 1到 t = m，其中 mathbb { n }中的 m 是 AIXI 代理的生命周期。在时间步骤 t 中，代理在数学{ a }中选择一个动作 a _ t (例如:。肢体运动) ，并在环境中执行，而环境在数学{ e } = cal { o }乘以数学{ r }中响应一个“知觉”e_t,它包括数学中的“观察”和数学中的“奖赏”，按照条件概率 mu (o t r t | a 1 o 1 r 1... a { t-1} r { t-1} a t)分布，其中 a 1 o 1 r 1。. a _ { t-1} o _ { t-1} r _ { t-1} a _ t 是行动、观察和奖励的“历史”。因此，mu 环境在数学上表示为一个概率分布，而不是依赖于完整历史的“感知”(观察和奖励) ，所以没有马尔可夫假设(相对于其他 RL 算法)。再次注意，AIXI 代理不知道这个概率分布。此外，请再次注意，mu 是可计算的，也就是说，代理人从环境 mu 收到的观察值和奖励可以通过一些程序(运行在图灵机上)计算，给出了 AIXI 代理人过去的行为。

The only goal of the AIXI agent is to maximise $\displaystyle{ \sum_{t=1}^m r_t }$, that is, the sum of rewards from time step 1 to m.

The only goal of the AIXI agent is to maximise \sum_{t=1}^m r_t, that is, the sum of rewards from time step 1 to m.

AIXI 代理的唯一目标是最大化 sum { t = 1} ^ m r _ t，即从时间步骤1到 m 的奖励总和。

The AIXI agent is associated with a stochastic policy $\displaystyle{ \pi : (\mathcal{A} \times \mathcal{E})^* \rightarrow \mathcal{A} }$, which is the function it uses to choose actions at every time step, where $\displaystyle{ \mathcal{A} }$ is the space of all possible actions that AIXI can take and $\displaystyle{ \mathcal{E} }$ is the space of all possible "percepts" that can be produced by the environment. The environment (or probability distribution) $\displaystyle{ \mu }$ can also be thought of as a stochastic policy (which is a function): $\displaystyle{ \mu : (\mathcal{A} \times \mathcal{E})^* \times \mathcal{A} \rightarrow \mathcal{E} }$, where the $\displaystyle{ * }$ is the Kleene star operation.

The AIXI agent is associated with a stochastic policy \pi : (\mathcal{A} \times \mathcal{E})^* \rightarrow \mathcal{A}, which is the function it uses to choose actions at every time step, where \mathcal{A} is the space of all possible actions that AIXI can take and \mathcal{E} is the space of all possible "percepts" that can be produced by the environment. The environment (or probability distribution) \mu can also be thought of as a stochastic policy (which is a function): \mu : (\mathcal{A} \times \mathcal{E})^* \times \mathcal{A} \rightarrow \mathcal{E} , where the * is the Kleene star operation.

AIXI 代理与一个随机策略 pi 关联: (mathcal { a } times mathcal { e }) ^ * righttarrow mathcal { a } ，这是它用来在每个时间步骤中选择操作的函数，其中数学{ a }是 AIXI 可以执行的所有可能操作的空间，数学{ e }是环境可以产生的所有可能“感知器”的空间。环境(或者概率分布) mu 也可以被认为是一个随机策略(一个函数) : mu: (mathcal { a } times cal { e }) ^ * times mathcal { a } right tarrow mathcal { e } ，其中 * 是 Kleene 星运算。

In general, at time step $\displaystyle{ t }$ (which ranges from 1 to m), AIXI, having previously executed actions $\displaystyle{ a_1\dots a_{t-1} }$ (which is often abbreviated in the literature as $\displaystyle{ a_{\lt t} }$) and having observed the history of percepts $\displaystyle{ o_1 r_1 ... o_{t-1} r_{t-1} }$ (which can be abbreviated as $\displaystyle{ e_{\lt t} }$), chooses and executes in the environment the action, $\displaystyle{ a_t }$, defined as follows 

In general, at time step t (which ranges from 1 to m), AIXI, having previously executed actions a_1\dots a_{t-1} (which is often abbreviated in the literature as a_{<t}) and having observed the history of percepts o_1 r_1 ... o_{t-1} r_{t-1} (which can be abbreviated as e_{<t}), chooses and executes in the environment the action, a_t, defined as follows Universal Artificial Intelligence

$\displaystyle{ a_t := \arg \max_{a_t} \sum_{o_t r_t} \ldots \max_{a_m} \sum_{o_m r_m} [r_t + \ldots + r_m] \sum_{q:\; U(q, a_1 \ldots a_m) = o_1 r_1 \ldots o_m r_m} 2^{-\textrm{length}(q)} }$

a_t := \arg \max_{a_t} \sum_{o_t r_t} \ldots \max_{a_m} \sum_{o_m r_m} [r_t + \ldots + r_m] \sum_{q:\; U(q, a_1 \ldots a_m) = o_1 r_1 \ldots o_m r_m} 2^{-\textrm{length}(q)}

a_t := \arg \max_{a_t} \sum_{o_t r_t} \ldots \max_{a_m} \sum_{o_m r_m} [r_t + \ldots + r_m] \sum_{q:\; U(q, a_1 \ldots a_m) = o_1 r_1 \ldots o_m r_m} 2^{-\textrm{length}(q)}

or, using parentheses, to disambiguate the precedences

or, using parentheses, to disambiguate the precedences

$\displaystyle{ a_t := \arg \max_{a_t} \left( \sum_{o_t r_t} \ldots \left( \max_{a_m} \sum_{o_m r_m} [r_t + \ldots + r_m] \left( \sum_{q:\; U(q, a_1 \ldots a_m) = o_1 r_1 \ldots o_m r_m} 2^{-\textrm{length}(q)} \right) \right) \right) }$

a_t := \arg \max_{a_t} \left( \sum_{o_t r_t} \ldots \left( \max_{a_m} \sum_{o_m r_m} [r_t + \ldots + r_m] \left( \sum_{q:\; U(q, a_1 \ldots a_m) = o_1 r_1 \ldots o_m r_m} 2^{-\textrm{length}(q)} \right) \right) \right)

a_t := \arg \max_{a_t} \left( \sum_{o_t r_t} \ldots \left( \max_{a_m} \sum_{o_m r_m} [r_t + \ldots + r_m] \left( \sum_{q:\; U(q, a_1 \ldots a_m) = o_1 r_1 \ldots o_m r_m} 2^{-\textrm{length}(q)} \right) \right) \right)

Intuitively, in the definition above, AIXI considers the sum of the total reward over all possible "futures" up to $\displaystyle{ m - t }$ time steps ahead (that is, from $\displaystyle{ t }$ to $\displaystyle{ m }$), weighs each of them by the complexity of programs $\displaystyle{ q }$ (that is, by $\displaystyle{ 2^{-\textrm{length}(q)} }$) consistent with the agent's past (that is, the previously executed actions, $\displaystyle{ a_{\lt t} }$, and received percepts, $\displaystyle{ e_{\lt t} }$) that can generate that future, and then picks the action that maximises expected future rewards.

Intuitively, in the definition above, AIXI considers the sum of the total reward over all possible "futures" up to m - t time steps ahead (that is, from t to m), weighs each of them by the complexity of programs q (that is, by 2^{-\textrm{length}(q)}) consistent with the agent's past (that is, the previously executed actions, a_{<t}, and received percepts, e_{<t}) that can generate that future, and then picks the action that maximises expected future rewards.

Let us break this definition down in order to attempt to fully understand it.

Let us break this definition down in order to attempt to fully understand it.

$\displaystyle{ o_t r_t }$ is the "percept" (which consists of the observation $\displaystyle{ o_t }$ and reward $\displaystyle{ r_t }$) received by the AIXI agent at time step $\displaystyle{ t }$ from the environment (which is unknown and stochastic). Similarly, $\displaystyle{ o_m r_m }$ is the percept received by AIXI at time step $\displaystyle{ m }$ (the last time step where AIXI is active).

o_t r_t is the "percept" (which consists of the observation o_t and reward r_t) received by the AIXI agent at time step t from the environment (which is unknown and stochastic). Similarly, o_m r_m is the percept received by AIXI at time step m (the last time step where AIXI is active).

O _ t r _ t 是 AIXI 代理人在时间步 t 从环境(未知的、随机的)接收到的“知觉”(包括观察 o _ t 和奖励 r _ t)。类似地，o _ m r _ m 是 AIXI 在时间步 m (AIXI 处于活动状态的最后一个时间步)接收到的对象。

$\displaystyle{ r_t + \ldots + r_m }$ is the sum of rewards from time step $\displaystyle{ t }$ to time step $\displaystyle{ m }$, so AIXI needs to look into the future to choose its action at time step $\displaystyle{ t }$.

r_t + \ldots + r_m is the sum of rewards from time step t to time step m, so AIXI needs to look into the future to choose its action at time step t.

R _ t + ldots + r _ m 是从时间步 t 到时间步 m 的报酬之和，因此 AIXI 需要考虑未来来选择它在时间步 t 的行动。

$\displaystyle{ U }$ denotes a monotone universal Turing machine, and $\displaystyle{ q }$ ranges over all (deterministic) programs on the universal machine $\displaystyle{ U }$, which receives as input the program $\displaystyle{ q }$ and the sequence of actions $\displaystyle{ a_1\dots a_m }$ (that is, all actions), and produces the sequence of percepts $\displaystyle{ o_1 r_1 \ldots o_m r_m }$. The universal Turing machine $\displaystyle{ U }$ is thus used to "simulate" or compute the environment responses or percepts, given the program $\displaystyle{ q }$ (which "models" the environment) and all actions of the AIXI agent: in this sense, the environment is "computable" (as stated above). Note that, in general, the program which "models" the current and actual environment (where AIXI needs to act) is unknown because the current environment is also unknown.

U denotes a monotone universal Turing machine, and q ranges over all (deterministic) programs on the universal machine U, which receives as input the program q and the sequence of actions a_1\dots a_m (that is, all actions), and produces the sequence of percepts o_1 r_1 \ldots o_m r_m. The universal Turing machine U is thus used to "simulate" or compute the environment responses or percepts, given the program q (which "models" the environment) and all actions of the AIXI agent: in this sense, the environment is "computable" (as stated above). Note that, in general, the program which "models" the current and actual environment (where AIXI needs to act) is unknown because the current environment is also unknown.

$\displaystyle{ \textrm{length}(q) }$ is the length of the program $\displaystyle{ q }$ (which is encoded as a string of bits). Note that $\displaystyle{ 2^{-\textrm{length}(q)} = \frac{1}{2^{\textrm{length}(q)}} }$. Hence, in the definition above, $\displaystyle{ \sum_{q:\; U(q, a_1 \ldots a_m) = o_1 r_1 \ldots o_m r_m} 2^{-\textrm{length}(q)} }$ should be interpreted as a mixture (in this case, a sum) over all computable environments (which are consistent with the agent's past), each weighted by its complexity $\displaystyle{ 2^{-\textrm{length}(q)} }$. Note that $\displaystyle{ a_1 \ldots a_m }$ can also be written as $\displaystyle{ a_1 \ldots a_{t-1}a_t \ldots a_m }$, and $\displaystyle{ a_1 \ldots a_{t-1} = a_{\lt t} }$ is the sequence of actions already executed in the environment by the AIXI agent. Similarly, $\displaystyle{ o_1 r_1 \ldots o_m r_m = o_1 r_1 \ldots o_{t-1} r_{t-1}o_{t} r_{t} \ldots o_m r_m }$, and $\displaystyle{ o_1 r_1 \ldots o_{t-1} r_{t-1} }$ is the sequence of percepts produced by the environment so far.

\textrm{length}(q) is the length of the program q (which is encoded as a string of bits). Note that 2^{-\textrm{length}(q)} = \frac{1}{2^{\textrm{length}(q)}}. Hence, in the definition above, \sum_{q:\; U(q, a_1 \ldots a_m) = o_1 r_1 \ldots o_m r_m} 2^{-\textrm{length}(q)} should be interpreted as a mixture (in this case, a sum) over all computable environments (which are consistent with the agent's past), each weighted by its complexity 2^{-\textrm{length}(q)}. Note that a_1 \ldots a_m can also be written as a_1 \ldots a_{t-1}a_t \ldots a_m, and a_1 \ldots a_{t-1} = a_{<t} is the sequence of actions already executed in the environment by the AIXI agent. Similarly, o_1 r_1 \ldots o_m r_m = o_1 r_1 \ldots o_{t-1} r_{t-1}o_{t} r_{t} \ldots o_m r_m, and o_1 r_1 \ldots o_{t-1} r_{t-1} is the sequence of percepts produced by the environment so far.

Textrm { length }(q)是程序 q 的长度(被编码为位的字符串)。注意2 ^ {-textrm { length }(q)} = frac {1}{2 ^ { textrm { length }(q)}}。因此，在上面的定义中，sum _ { q: ; u (q，a _ 1 ldots a _ m) = o _ 1 r _ 1 ldots o _ m _ m }2 ^ {-textrm { length }(q)}应该被解释为在所有可计算环境中的混合物(在这种情况下，是一个和)(这与代理的过去一致) ，每个环境的复杂度为2 ^ {-textrm { length }(q)}。请注意，a _ 1 ldots a _ { t-1} a _ t dots a _ m 也可以写成 a _ { t-1} a _ t dots a _ m，a _ 1 ldots a _ { t-1} = a _ { < t }是 AIXI 代理已经在环境中执行的操作序列。类似地，o _ 1 r _ 1 ldots o _ m r _ m = o _ 1 r _ 1 ldots o _ { t-1} r _ { t-1} o _ { t-1} r _ m r _ m，o _ 1 r _ 1 ldots o _ { t-1}是目前为止由环境产生的感知序列。

Let us now put all these components together in order to understand this equation or definition.

Let us now put all these components together in order to understand this equation or definition.

At time step t, AIXI chooses the action $\displaystyle{ a_t }$ where the function $\displaystyle{ \sum_{o_t r_t} \ldots \max_{a_m} \sum_{o_m r_m} [r_t + \ldots + r_m] \sum_{q:\; U(q, a_1 \ldots a_m) = o_1 r_1 \ldots o_m r_m} 2^{-\textrm{length}(q)} }$ attains its maximum.

At time step t, AIXI chooses the action a_t where the function \sum_{o_t r_t} \ldots \max_{a_m} \sum_{o_m r_m} [r_t + \ldots + r_m] \sum_{q:\; U(q, a_1 \ldots a_m) = o_1 r_1 \ldots o_m r_m} 2^{-\textrm{length}(q)} attains its maximum.

# = = 参数 = =

The parameters to AIXI are the universal Turing machine U and the agent's lifetime m, which need to be chosen. The latter parameter can be removed by the use of discounting.

The parameters to AIXI are the universal Turing machine U and the agent's lifetime m, which need to be chosen. The latter parameter can be removed by the use of discounting.

AIXI 的参数是通用图灵机 u 和代理的生存期 m，这两个参数需要选择。后一个参数可以通过折现来消除。

# = 单词 AIXI 的意思 =

According to Hutter, the word "AIXI" can have several interpretations. AIXI can stand for AI based on Solomonoff's distribution, denoted by $\displaystyle{ \xi }$ (which is the Greek letter xi), or e.g. it can stand for AI "crossed" (X) with induction (I). There are other interpretations.

According to Hutter, the word "AIXI" can have several interpretations. AIXI can stand for AI based on Solomonoff's distribution, denoted by \xi (which is the Greek letter xi), or e.g. it can stand for AI "crossed" (X) with induction (I). There are other interpretations.

# = 最优化 =

AIXI's performance is measured by the expected total number of rewards it receives. AIXI has been proven to be optimal in the following ways.

AIXI's performance is measured by the expected total number of rewards it receives. AIXI has been proven to be optimal in the following ways.

AIXI 的表现是通过预期的总奖励数量来衡量的。AIXI 已被证明在以下方面是最佳的。

• Pareto optimality: there is no other agent that performs at least as well as AIXI in all environments while performing strictly better in at least one environment.[citation needed]
• Balanced Pareto optimality: Like Pareto optimality, but considering a weighted sum of environments.
• Self-optimizing: a policy p is called self-optimizing for an environment $\displaystyle{ \mu }$ if the performance of p approaches the theoretical maximum for $\displaystyle{ \mu }$ when the length of the agent's lifetime (not time) goes to infinity. For environment classes where self-optimizing policies exist, AIXI is self-optimizing.
• Pareto optimality: there is no other agent that performs at least as well as AIXI in all environments while performing strictly better in at least one environment.
• Balanced Pareto optimality: Like Pareto optimality, but considering a weighted sum of environments.
• Self-optimizing: a policy p is called self-optimizing for an environment \mu if the performance of p approaches the theoretical maximum for \mu when the length of the agent's lifetime (not time) goes to infinity. For environment classes where self-optimizing policies exist, AIXI is self-optimizing.

• 帕累托最优: 没有其他代理在所有环境中的性能至少比 AIXI 好，而在至少一个环境中的性能更好。
• 平衡的帕累托最优: 就像帕累托最优一样，但是考虑环境的加权和。
• 自寻优: 当代理人的生命周期(而非时间)无限长时，如果 p 的性能接近 mu 的理论最大值，则称策略 p 为环境 mu 的自寻优。对于存在自优化策略的环境类，AIXI 是自优化的。

It was later shown by Hutter and Jan Leike that balanced Pareto optimality is subjective and that any policy can be considered Pareto optimal, which they describe as undermining all previous optimality claims for AIXI.

It was later shown by Hutter and Jan Leike that balanced Pareto optimality is subjective and that any policy can be considered Pareto optimal, which they describe as undermining all previous optimality claims for AIXI.

However, AIXI does have limitations. It is restricted to maximizing rewards based on percepts as opposed to external states. It also assumes it interacts with the environment solely through action and percept channels, preventing it from considering the possibility of being damaged or modified. Colloquially, this means that it doesn't consider itself to be contained by the environment it interacts with. It also assumes the environment is computable.

However, AIXI does have limitations. It is restricted to maximizing rewards based on percepts as opposed to external states. It also assumes it interacts with the environment solely through action and percept channels, preventing it from considering the possibility of being damaged or modified. Colloquially, this means that it doesn't consider itself to be contained by the environment it interacts with. It also assumes the environment is computable.

# = 计算方面 =

Like Solomonoff induction, AIXI is incomputable. However, there are computable approximations of it. One such approximation is AIXItl, which performs at least as well as the provably best time t and space l limited agent. Another approximation to AIXI with a restricted environment class is MC-AIXI (FAC-CTW) (which stands for Monte Carlo AIXI FAC-Context-Tree Weighting), which has had some success playing simple games such as partially observable Pac-Man.

Like Solomonoff induction, AIXI is incomputable. However, there are computable approximations of it. One such approximation is AIXItl, which performs at least as well as the provably best time t and space l limited agent. Another approximation to AIXI with a restricted environment class is MC-AIXI (FAC-CTW) (which stands for Monte Carlo AIXI FAC-Context-Tree Weighting), which has had some success playing simple games such as partially observable Pac-Man.Playing Pacman using AIXI Approximation - YouTube