强化学习

此词条暂由Henry翻译

本词条已有Ricky审校。

--Ricky：这一词条的行文非常糟糕，但其对应的维基百科原文就是这么糟糕吧。。。强烈希望有第二个审校再审一遍，最好是有点强化学习的背景知识的伙伴。可以考虑用scholarpedia的相应词条作为原文本。

模板:Machine learning bar

Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.

强化学习Reinforcement learning是机器学习的一个领域，它关注的是软件主体（software agent）应该在一个环境中采取何种行动，才能最大化总体收益。强化学习，有监督学习和无监督学习是机器学习的三种基本范式。

Reinforcement learning differs from supervised learning in not needing labelled input/output pairs be presented, and in not needing sub-optimal actions to be explicitly corrected. Instead the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).^[1]

Reinforcement learning differs from supervised learning in not needing labelled input/output pairs be presented, and in not needing sub-optimal actions to be explicitly corrected. Instead the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).

强化学习与有监督学习的不同之处在于，前者不需要标记输入、输出对，也不需要对次优操作进行显式修正。相反，强化学习的重点是在探索(未知领域)和利用(现有知识)之间找到平衡。^[1]

The environment is typically stated in the form of a Markov decision process (MDP), because many reinforcement learning algorithms for this context utilize dynamic programming techniques.^[2] The main difference between the classical dynamic programming methods and reinforcement learning algorithms is that the latter do not assume knowledge of an exact mathematical model of the MDP and they target large MDPs where exact methods become infeasible.

The environment is typically stated in the form of a Markov decision process (MDP), because many reinforcement learning algorithms for this context utilize dynamic programming techniques. The main difference between the classical dynamic programming methods and reinforcement learning algorithms is that the latter do not assume knowledge of an exact mathematical model of the MDP and they target large MDPs where exact methods become infeasible.

主体所在的环境通常是以马可夫决策过程 Markov decision process(MDP)的形式表示的，因为许多在此情景下的强化学习算法都使用了动态规划方法。经典的动态规划方法和强化学习算法间主要区别在于后者并不假设精确的MDP数学模型的知识，而是关注于那些精确方法不可行的大型MDP场景。

介绍

文件:Reinforcement learning diagram.svg

The typical framing of a Reinforcement Learning (RL) scenario: an agent takes actions in an environment, which is interpreted into a reward and a representation of the state, which are fed back into the agent.

The typical framing of a Reinforcement Learning (RL) scenario: an agent takes actions in an environment, which is interpreted into a reward and a representation of the state, which are fed back into the agent.

强化学习的典型场景: 主体在环境中采取某项行动，这像行动会被程序翻译成收益和状态的表示，而收益和状态又会被反馈回主体。

Reinforcement learning, due to its generality, is studied in many other disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, multi-agent systems, swarm intelligence, statistics and genetic algorithms. In the operations research and control literature, reinforcement learning is called approximate dynamic programming, or neuro-dynamic programming. The problems of interest in reinforcement learning have also been studied in the theory of optimal control, which is concerned mostly with the existence and characterization of optimal solutions, and algorithms for their exact computation, and less with learning or approximation, particularly in the absence of a mathematical model of the environment. In economics and game theory, reinforcement learning may be used to explain how equilibrium may arise under bounded rationality.

Reinforcement learning, due to its generality, is studied in many other disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, multi-agent systems, swarm intelligence, statistics and genetic algorithms. In the operations research and control literature, reinforcement learning is called approximate dynamic programming, or neuro-dynamic programming. The problems of interest in reinforcement learning have also been studied in the theory of optimal control, which is concerned mostly with the existence and characterization of optimal solutions, and algorithms for their exact computation, and less with learning or approximation, particularly in the absence of a mathematical model of the environment. In economics and game theory, reinforcement learning may be used to explain how equilibrium may arise under bounded rationality.

由于它的普遍性，强化学习在许多其他学科中都有所应用，如博弈论 game theory、控制论 control theory、运筹学 operations research、信息论 information theory、基于仿真的优化、多主体系统、群体智能、统计学 statistics 和遗传算法等。在运筹学和控制论中，强化学习被称为近似的动态规划，或者神经动态规划。其中最优控制理论也研究了强化学习中一些热点问题。最优控制理论主要关注最优解的存在性、形式及其精确计算的算法，在缺乏环境数学模型的情况下的学习或近似的方法则关注较少。在经济学和博弈论中，强化学习可以用来解释均衡性是如何在有限理性下产生的。

Basic reinforcement is modeled as a Markov decision process:

基本的强化被模拟成一个马尔可夫决策过程 Markov decision process

a set of environment and agent states, $S$ ;
a set of actions, $A$ , of the agent;
[math]\displaystyle{ P_a(s,s')=\Pr(s_{t+1}=s'\mid s_t=s, a_t=a) }[/math] is the probability of transition (at time [math]\displaystyle{ t }[/math]) from state [math]\displaystyle{ s }[/math] to state [math]\displaystyle{ s' }[/math] under action [math]\displaystyle{ a }[/math].
[math]\displaystyle{ R_a(s,s') }[/math] is the immediate reward after transition from [math]\displaystyle{ s }[/math] to [math]\displaystyle{ s' }[/math] with action [math]\displaystyle{ a }[/math].
rules that describe what the agent observes

一组环境和主体集合 $S$ ;
一组主体的行为 $A$ ;
Pa(s,s′)=Pr(st+1=s′∣st=s,at=a)是在时间t和行为a下从s到 s′ 状态转变的可能性

[math]\displaystyle{ P_a(s,s')=\Pr(s_{t+1}=s'\mid s_t=s, a_t=a) }[/math]是在时间[math]\displaystyle{ t }[/math]时，主体采取行动[math]\displaystyle{ a }[/math]后，状态从[math]\displaystyle{ s }[/math]到[math]\displaystyle{ s' }[/math]转移的概率

[math]\displaystyle{ R_a(s,s') }[/math]是从s到s′过渡所得到的即时收益
描述主体要遵守的规则

Rules are often stochastic. The observation typically involves the scalar, immediate reward associated with the last transition. In many works, the agent is assumed to observe the current environmental state (full observability). If not, the agent has partial observability. Sometimes the set of actions available to the agent is restricted (a zero balance cannot be reduced. For example, if the current value of the agent is 3 and the state transition reduces the value by 4, the transition will not be allowed).

Rules are often stochastic. The observation typically involves the scalar, immediate reward associated with the last transition. In many works, the agent is assumed to observe the current environmental state (full observability). If not, the agent has partial observability. Sometimes the set of actions available to the agent is restricted (a zero balance cannot be reduced. For example, if the current value of the agent is 3 and the state transition reduces the value by 4, the transition will not be allowed).

强化学习模型中规则往往是带有随机性，观察通常会包括上一次状态转移带来的收益。在许多工作中，主体被假定能够观察当前的环境状态(完全可观察性)。如果不能，则称主体具有部分可观测性。有时，主体可用的操作集会受到限制(比如说，账户余额不能减到0以下。如果主体的当前值为3，而状态转换将该值减少4，这样的转换是不被允许的)。

A reinforcement learning agent interacts with its environment in discrete time steps. At each time $t$ , the agent receives an observation [math]\displaystyle{ o_t }[/math], which typically includes the reward [math]\displaystyle{ r_t }[/math]. It then chooses an action [math]\displaystyle{ a_t }[/math] from the set of available actions, which is subsequently sent to the environment. The environment moves to a new state [math]\displaystyle{ s_{t+1} }[/math] and the reward [math]\displaystyle{ r_{t+1} }[/math] associated with the transition [math]\displaystyle{ (s_t,a_t,s_{t+1}) }[/math] is determined. The goal of a reinforcement learning agent is to collect as much reward as possible. The agent can (possibly randomly) choose any action as a function of the history.

A reinforcement learning agent interacts with its environment in discrete time steps. At each time , the agent receives an observation [math]\displaystyle{ o_t }[/math], which typically includes the reward [math]\displaystyle{ r_t }[/math]. It then chooses an action [math]\displaystyle{ a_t }[/math] from the set of available actions, which is subsequently sent to the environment. The environment moves to a new state [math]\displaystyle{ s_{t+1} }[/math] and the reward [math]\displaystyle{ r_{t+1} }[/math] associated with the transition [math]\displaystyle{ (s_t,a_t,s_{t+1}) }[/math] is determined. The goal of a reinforcement learning agent is to collect as much reward as possible. The agent can (possibly randomly) choose any action as a function of the history.

强化学习中的主体以离散的时间步骤与环境进行交互。在每一个时间步 $t$ ，主体都会收到一个观察[math]\displaystyle{ o_t }[/math]，这通常包括收益[math]\displaystyle{ r_t }[/math]。然后，它从一组可用的操作中选择一个操作 [math]\displaystyle{ a_t }[/math]，该操作随后被发送到环境中。环境转移到一个新的状态数学 [math]\displaystyle{ s_{t+1} }[/math]，并确定与转换 [math]\displaystyle{ (s_t,a_t,s_{t+1}) }[/math] 相关的奖励 [math]\displaystyle{ r_{t+1} }[/math]。强化学习主体的目标是尽可能多地获取收益。主体可以(可能是随机的)根据历史选择任何操作。

When the agent's performance is compared to that of an agent that acts optimally, the difference in performance gives rise to the notion of regret. In order to act near optimally, the agent must reason about the long term consequences of its actions (i.e., maximize future income), although the immediate reward associated with this might be negative.

当某主体的表现与理论最优主体的表现相比较时，表现上的差值就是“后悔”。为了近乎最优地行动，主体必须推理其行动的长期后果(即，最大化未来收入) ，尽管与此相关的短期收益可能是负的。

Thus, reinforcement learning is particularly well-suited to problems that include a long-term versus short-term reward trade-off. It has been applied successfully to various problems, including robot control, elevator scheduling, telecommunications, backgammon, checkers and Go (AlphaGo).

因此，强化学习特别适合解决包括长期与短期收益均衡在内的问题。它已成功地应用于各种问题，包括机器人控制、电梯调度、通信、西洋双陆棋、跳棋和围棋（AlphaGo）。

Two elements make reinforcement learning powerful: the use of samples to optimize performance and the use of function approximation to deal with large environments. Thanks to these two key components, reinforcement learning can be used in large environments in the following situations:

使得强化学习功能强大的因素有两个: 使用样本来优化性能和使用函数逼近来处理大型环境。正因为有了这两个关键组件，强化学习可以在以下情况下在大型环境中使用:

A model of the environment is known, but an analytic solution is not available;
Only a simulation model of the environment is given (the subject of simulation-based optimization);^[3]
The only way to collect information about the environment is to interact with it.

一种环境的模型是已知的，但是无法得到其数学解析式；
只给出了环境的模拟模型；^[4]
收集环境信息的唯一方法就是去与环境交互；

The first two of these problems could be considered planning problems (since some form of model is available), while the last one could be considered to be a genuine learning problem. However, reinforcement learning converts both planning problems to machine learning problems.

前两个问题可以被认为是规划问题(因为某种形式的模型是可用的) ，而后一个问题可以被认为是真正的学习问题。然而，强化学习可以将前两个规划问题转化为机器学习问题。

探索

The exploration vs. exploitation trade-off has been most thoroughly studied through the multi-armed bandit problem and for finite state space MDPs in Burnetas and Katehakis (1997).^[5]

The exploration vs. exploitation trade-off has been most thoroughly studied through the multi-armed bandit problem and for finite state space MDPs in Burnetas and Katehakis (1997).

Burnetas 和 Katehakis (1997)^[6] 已经通过多臂老虎机问题非常彻底地研究了有限的状态空间MDPs中探索与利用的权衡。

Reinforcement learning requires clever exploration mechanisms; randomly selecting actions, without reference to an estimated probability distribution, shows poor performance. The case of (small) finite Markov decision processes is relatively well understood. However, due to the lack of algorithms that scale well with the number of states (or scale to problems with infinite state spaces), simple exploration methods are the most practical.

强化学习需要聪明的探索机制; 不参考概率分布来随机地选择行动，性能表现往往会很糟糕。有限马尔可夫决策过程的情况是相对较好理解的。然而，由于缺乏可以在状态的数量很大时依然高效运行的算法(或者说适用于无限状态空间问题的算法) ，多数情况下简单的探索方法是最实用的。

One such method is [math]\displaystyle{ \epsilon }[/math]-greedy, where [math]\displaystyle{ 0 \lt \epsilon \lt 1 }[/math] is a parameter controlling the amount of exploration vs. exploitation. With probability [math]\displaystyle{ 1-\epsilon }[/math], exploitation is chosen, and the agent chooses the action that it believes has the best long-term effect (ties between actions are broken uniformly at random). Alternatively, with probability [math]\displaystyle{ \epsilon }[/math], exploration is chosen, and the action is chosen uniformly at random. [math]\displaystyle{ \epsilon }[/math] is usually a fixed parameter but can be adjusted either according to a schedule (making the agent explore progressively less), or adaptively based on heuristics.^[7]

One such method is [math]\displaystyle{ \epsilon }[/math]-greedy, where [math]\displaystyle{ 0 \lt \epsilon \lt 1 }[/math] is a parameter controlling the amount of exploration vs. exploitation. With probability [math]\displaystyle{ 1-\epsilon }[/math], exploitation is chosen, and the agent chooses the action that it believes has the best long-term effect (ties between actions are broken uniformly at random). Alternatively, with probability [math]\displaystyle{ \epsilon }[/math], exploration is chosen, and the action is chosen uniformly at random. [math]\displaystyle{ \epsilon }[/math] is usually a fixed parameter but can be adjusted either according to a schedule (making the agent explore progressively less), or adaptively based on heuristics.

其中一个方法是ϵ-greedy，其中 0<ϵ<1 是一个控制探索与开发的数量的参数。主体有1−ϵ的概率选择“利用”，选择它认为具有最佳长期效果的行动(若几个行动的效果一样好，则均匀地随机选择)。主体还有ϵ的概率选择探索，并随机均匀地选择动作。ϵ通常是一个固定的参数，但可以根据时间进行调整(使主体逐步减少探索) ，或者根据一些启发式信息调整。

控制学习的算法

Even if the issue of exploration is disregarded and even if the state was observable (assumed hereafter), the problem remains to use past experience to find out which actions lead to higher cumulative rewards.

即使忽略探索与利用的问题，且状态是可观察的，最大的问题仍然是如何利用过去的经验来找出哪些行动可以带来更高的总体收益。

最优性的定义和标准

策略

The agent's action selection is modeled as a map called policy:

The agent's action selection is modeled as a map called policy:

主体的行动选择可以被建模为一个称为策略（policy）的映射:

[math]\displaystyle{ \pi: A \times S \rightarrow [0,1] }[/math]

[math]\displaystyle{ \pi(a,s) = \Pr(a_t = a\mid s_t =s) }[/math]

The policy map gives the probability of taking action [math]\displaystyle{ a }[/math] when in state [math]\displaystyle{ s }[/math].^[8]^:61 There are also non-probabilistic policies.

The policy map gives the probability of taking action [math]\displaystyle{ a }[/math] when in state [math]\displaystyle{ s }[/math]. There are also non-probabilistic policies.

策略映射给出了在状态[math]\displaystyle{ s }[/math]下采取行动[math]\displaystyle{ a }[/math]的概率。^[8]^:61此外，也存在一些非概率化的策略。

“状态-价值” 函数

Value function [math]\displaystyle{ V_\pi(s) }[/math] is defined as the expected return starting with state [math]\displaystyle{ s }[/math], i.e. [math]\displaystyle{ s_0 = s }[/math], and successively following policy [math]\displaystyle{ \pi }[/math]. Hence, roughly speaking, the value function estimates "how good" it is to be in a given state.^[8]^:60

Value function [math]\displaystyle{ V_\pi(s) }[/math] is defined as the expected return starting with state [math]\displaystyle{ s }[/math], i.e. [math]\displaystyle{ s_0 = s }[/math], and successively following policy [math]\displaystyle{ \pi }[/math]. Hence, roughly speaking, the value function estimates "how good" it is to be in a given state.

价值函数[math]\displaystyle{ V_\pi(s) }[/math]定义为从状态[math]\displaystyle{ s }[/math]开始的期望收益，即[math]\displaystyle{ s_0 = s }[/math]，后续的策略为[math]\displaystyle{ \pi }[/math]。因此，粗略地说，价值函数估计一个给定状态有多“好”。

[math]\displaystyle{ V_\pi(s) = \operatorname E[R] = \operatorname E\left[\sum_{t=0}^\infty \gamma^t r_t\mid s_0 = s\right], }[/math]

where the random variable [math]\displaystyle{ R }[/math] denotes the return, and is defined as the sum of future discounted rewards模板:Clarify

where the random variable [math]\displaystyle{ R }[/math] denotes the return, and is defined as the sum of future discounted rewards

其中随机变量[math]\displaystyle{ R }[/math]表示收益，而且是未来总体收益的折现。

[math]\displaystyle{ R=\sum_{t=0}^\infty \gamma^t r_t, }[/math]

where [math]\displaystyle{ r_t }[/math] is the reward at step [math]\displaystyle{ t }[/math], [math]\displaystyle{ \gamma \in [0,1) }[/math] is the discount-rate.

[math]\displaystyle{ r_t }[/math]是在时间步[math]\displaystyle{ t }[/math]的收益，[math]\displaystyle{ \gamma \in [0,1) }[/math] 是折现率。

The algorithm must find a policy with maximum expected return. From the theory of MDPs it is known that, without loss of generality, the search can be restricted to the set of so-called stationary policies. A policy is stationary if the action-distribution returned by it depends only on the last state visited (from the observation agent's history). The search can be further restricted to deterministic stationary policies. A deterministic stationary policy deterministically selects actions based on the current state. Since any such policy can be identified with a mapping from the set of states to the set of actions, these policies can be identified with such mappings with no loss of generality.

The algorithm must find a policy with maximum expected return. From the theory of MDPs it is known that, without loss of generality, the search can be restricted to the set of so-called stationary policies. A policy is stationary if the action-distribution returned by it depends only on the last state visited (from the observation agent's history). The search can be further restricted to deterministic stationary policies. A deterministic stationary policy deterministically selects actions based on the current state. Since any such policy can be identified with a mapping from the set of states to the set of actions, these policies can be identified with such mappings with no loss of generality.

算法必须找到一个期望收益最大的策略。根据 MDPs 理论，我们知道，搜索可以不失一般性地被限制在所谓的平稳政策集合中。如果策略返回的行动分布仅取决于最后访问的状态(主体的历史记录) ，则策略是平稳的。这种搜索可以进一步限制为确定性平稳策略。确定性平稳策略根据当前状态确定地选择操作。不损失一般性地说，任何这样的策略都可以通过从状态集到行动集的映射来标识。

蛮力方法

The brute force approach entails two steps:

蛮力方法包括两个步骤:

For each possible policy, sample returns while following it
Choose the policy with the largest expected return

对于每一个可能的策略，记录下每次使用该策略的收益
选择具有最大期望收益的政策

One problem with this is that the number of policies can be large, or even infinite. Another is that variance of the returns may be large, which requires many samples to accurately estimate the return of each policy.

这样做的一个问题是，策略的数量可能很大，甚至可能是无限的。另一个原因是收益的方差可能很大，这需要很多次采样来准确地估计每个策略的收益。

These problems can be ameliorated if we assume some structure and allow samples generated from one policy to influence the estimates made for others. The two main approaches for achieving this are value function estimation and direct policy search.

如果我们假设某些结构，并允许从一个策略中生成的样本影响为其他策略所做的估计，那么这些问题可以得到改善。实现这一目标的两种主要方法是价值函数估计和直接策略搜索。

价值函数

Value function approaches attempt to find a policy that maximizes the return by maintaining a set of estimates of expected returns for some policy (usually either the "current" [on-policy] or the optimal [off-policy] one).

价值函数 Value function方法通过维护一组对某种策略的预期收益的估计值，来找到一个使收益最大化的策略。

These methods rely on the theory of MDPs, where optimality is defined in a sense that is stronger than the above one: A policy is called optimal if it achieves the best expected return from any initial state (i.e., initial distributions play no role in this definition). Again, an optimal policy can always be found amongst stationary policies.

这些方法依赖于 MDPs 理论，其中最优性的定义强于上述各种定义: 如果一个策略从任何初始状态都能获得最佳预期收益(即，初始分配在这个定义中没有任何作用) ，则称其为最优策略。同样，在平稳策略中总能找到最优策略。

To define optimality in a formal manner, define the value of a policy [math]\displaystyle{ \pi }[/math] by

要以形式化地定义最优性，可通过以下方式定义策略[math]\displaystyle{ \pi }[/math]的值

[math]\displaystyle{ V^{\pi} (s) = E[R\mid s,\pi], }[/math]

where [math]\displaystyle{ R }[/math] stands for the return associated with following [math]\displaystyle{ \pi }[/math] from the initial state [math]\displaystyle{ s }[/math]. Defining [math]\displaystyle{ V^*(s) }[/math] as the maximum possible value of [math]\displaystyle{ V^\pi(s) }[/math], where [math]\displaystyle{ \pi }[/math] is allowed to change,

其中[math]\displaystyle{ R }[/math]表示初始状态[math]\displaystyle{ s }[/math]与策略[math]\displaystyle{ \pi }[/math]相关的收益。[math]\displaystyle{ V^*(s) }[/math]则被定义为 [math]\displaystyle{ V^\pi(s) }[/math]的最大可能值，其中[math]\displaystyle{ \pi }[/math]允许更改,

[math]\displaystyle{ V^*(s) = \max_\pi V^\pi(s). }[/math]

A policy that achieves these optimal values in each state is called optimal. Clearly, a policy that is optimal in this strong sense is also optimal in the sense that it maximizes the expected return [math]\displaystyle{ \rho^\pi }[/math], since [math]\displaystyle{ \rho^\pi = E[ V^\pi(S) ] }[/math], where [math]\displaystyle{ S }[/math] is a state randomly sampled from the distribution [math]\displaystyle{ \mu }[/math]模板:Clarify.

A policy that achieves these optimal values in each state is called optimal. Clearly, a policy that is optimal in this strong sense is also optimal in the sense that it maximizes the expected return [math]\displaystyle{ \rho^\pi }[/math], since [math]\displaystyle{ \rho^\pi = E[ V^\pi(S) ] }[/math], where [math]\displaystyle{ S }[/math] is a state randomly sampled from the distribution [math]\displaystyle{ \mu }[/math].

在每个状态下都能达到这些最优值的策略称为最优策略。显然，一个策略如果在这种强定义下是最优，那么它一定也最大化了预期收益率[math]\displaystyle{ \rho^\pi }[/math]，因为 [math]\displaystyle{ \rho^\pi = E[ V^\pi(S) ] }[/math]，其中[math]\displaystyle{ S }[/math]是从分布 [math]\displaystyle{ \mu }[/math]随机抽样得到的状态。

Although state-values suffice to define optimality, it is useful to define action-values. Given a state [math]\displaystyle{ s }[/math], an action [math]\displaystyle{ a }[/math] and a policy [math]\displaystyle{ \pi }[/math], the action-value of the pair [math]\displaystyle{ (s,a) }[/math] under [math]\displaystyle{ \pi }[/math] is defined by

尽管“状态-值”足以定义最优性，但定义“行动-值”还是是有用的。给定一个状态[math]\displaystyle{ s }[/math]、一个行动[math]\displaystyle{ a }[/math]和一个策略[math]\displaystyle{ \pi }[/math]，元组(s，a)的“行动-值”的定义是

[math]\displaystyle{ Q^\pi(s,a) = \operatorname E[R\mid s,a,\pi],\, }[/math]

where [math]\displaystyle{ R }[/math] now stands for the random return associated with first taking action [math]\displaystyle{ a }[/math] in state [math]\displaystyle{ s }[/math] and following [math]\displaystyle{ \pi }[/math], thereafter.

其中[math]\displaystyle{ R }[/math]现在代表的是随机收益，这个收益与第一次在状态[math]\displaystyle{ s }[/math]中采取行动的[math]\displaystyle{ a }[/math]和之后的策略[math]\displaystyle{ \pi }[/math]相关。

The theory of MDPs states that if [math]\displaystyle{ \pi^* }[/math] is an optimal policy, we act optimally (take the optimal action) by choosing the action from [math]\displaystyle{ Q^{\pi^*}(s,\cdot) }[/math] with the highest value at each state, [math]\displaystyle{ s }[/math]. The action-value function of such an optimal policy ([math]\displaystyle{ Q^{\pi^*} }[/math]) is called the optimal action-value function and is commonly denoted by [math]\displaystyle{ Q^* }[/math]. In summary, the knowledge of the optimal action-value function alone suffices to know how to act optimally.

The theory of MDPs states that if [math]\displaystyle{ \pi^* }[/math] is an optimal policy, we act optimally (take the optimal action) by choosing the action from [math]\displaystyle{ Q^{\pi^*}(s,\cdot) }[/math] with the highest value at each state, [math]\displaystyle{ s }[/math]. The action-value function of such an optimal policy ([math]\displaystyle{ Q^{\pi^*} }[/math]) is called the optimal action-value function and is commonly denoted by [math]\displaystyle{ Q^* }[/math]. In summary, the knowledge of the optimal action-value function alone suffices to know how to act optimally.

MDPs 理论指出，如果[math]\displaystyle{ \pi^* }[/math]是一个最优策略，那么我们通过从 [math]\displaystyle{ Q^{\pi^*}(s,\cdot) }[/math] 中选择每个状态[math]\displaystyle{ s }[/math]下的最大值来优化操作(采取最优操作)。这种最优策略的操作值函数（[math]\displaystyle{ Q^{\pi^*} }[/math]）称为最优操作值函数，通常由[math]\displaystyle{ Q^* }[/math]表示。总之，仅仅知道最优行动值函数就足以知道如何进行最优行动。

Assuming full knowledge of the MDP, the two basic approaches to compute the optimal action-value function are value iteration and policy iteration. Both algorithms compute a sequence of functions [math]\displaystyle{ Q_k }[/math] ([math]\displaystyle{ k=0,1,2,\ldots }[/math]) that converge to [math]\displaystyle{ Q^* }[/math]. Computing these functions involves computing expectations over the whole state-space, which is impractical for all but the smallest (finite) MDPs. In reinforcement learning methods, expectations are approximated by averaging over samples and using function approximation techniques to cope with the need to represent value functions over large state-action spaces.

在充分了解 MDP 的前提下，计算最优动作值函数的两种基本方法是值迭代法和策略迭代法。两种算法都计算一个函数序列[math]\displaystyle{ Q_k }[/math] ([math]\displaystyle{ k=0,1,2,\ldots }[/math])，这个序列收敛于[math]\displaystyle{ Q^* }[/math] 。计算这些函数需要计算整个状态空间上的期望值，这对于除了最小(有限) MDP 之外的所有 MDP 都是不切实际的。在强化学习方法中，期望值是近似的，通过对样本进行平均，并使用函数逼近技术来满足在大型状态动作空间上表示值函数的需要。

蒙特卡罗方法

Monte Carlo methods can be used in an algorithm that mimics policy iteration. Policy iteration consists of two steps: policy evaluation and policy improvement.

Monte Carlo methods can be used in an algorithm that mimics policy iteration. Policy iteration consists of two steps: policy evaluation and policy improvement.

蒙特卡罗方法 Monte Carlo methods可用于模拟策略迭代的算法。策略迭代包括两个步骤: 策略评估和策略改进。

Monte Carlo is used in the policy evaluation step. In this step, given a stationary, deterministic policy [math]\displaystyle{ \pi }[/math], the goal is to compute the function values [math]\displaystyle{ Q^\pi(s,a) }[/math] (or a good approximation to them) for all state-action pairs [math]\displaystyle{ (s,a) }[/math]. Assuming (for simplicity) that the MDP is finite, that sufficient memory is available to accommodate the action-values and that the problem is episodic and after each episode a new one starts from some random initial state. Then, the estimate of the value of a given state-action pair [math]\displaystyle{ (s,a) }[/math] can be computed by averaging the sampled returns that originated from [math]\displaystyle{ (s,a) }[/math] over time. Given sufficient time, this procedure can thus construct a precise estimate [math]\displaystyle{ Q }[/math] of the action-value function [math]\displaystyle{ Q^\pi }[/math]. This finishes the description of the policy evaluation step.

蒙特卡罗方法被用在了策略评估步骤。在这个步骤中，给定一个平稳的、确定性的策略[math]\displaystyle{ \pi }[/math]，目标是计算所有状态动作对[math]\displaystyle{ (s,a) }[/math]的函数值[math]\displaystyle{ Q^\pi(s,a) }[/math](或者一个很好的近似值)。(为简单起见)假设 MDP 是有限的，且计算机有足够的内存来容纳动作值，问题是分阶段的的，每阶段之后一个新阶段从某个随机的初始状态开始。然后，通过对来自[math]\displaystyle{ (s,a) }[/math]随时间变化的抽样收益进行平均，可以计算出给定状态动作对[math]\displaystyle{ (s,a) }[/math]的值。给定足够的时间，这个过程可以构造出动作值函数 [math]\displaystyle{ Q^\pi }[/math]的一个精确的估计[math]\displaystyle{ Q }[/math]。

In the policy improvement step, the next policy is obtained by computing a greedy policy with respect to [math]\displaystyle{ Q }[/math]: Given a state [math]\displaystyle{ s }[/math], this new policy returns an action that maximizes [math]\displaystyle{ Q(s,\cdot) }[/math]. In practice lazy evaluation can defer the computation of the maximizing actions to when they are needed.

在策略改进步骤中，新的策略是通过计算关于[math]\displaystyle{ Q }[/math]的贪婪策略获得的: 给定一个状态[math]\displaystyle{ s }[/math]，这个新策略返回一个最大化[math]\displaystyle{ Q(s,\cdot) }[/math]的行动。实际上，惰懒计算可以将最大化行动的计算推迟到需要使用到它们的时候。

Problems with this procedure include:

这一方法的问题包括:

The procedure may spend too much time evaluating a suboptimal policy.
It uses samples inefficiently in that a long trajectory improves the estimate only of the single state-action pair that started the trajectory.
When the returns along the trajectories have high variance, convergence is slow.
It works in episodic problems only;
It works in small, finite MDPs only.

程序可能会花过多的时间在评估次优策略上
它使用样本的效率很低，因为长过程只会提高对启动轨迹的那一个“状态-行动“元组的估计
当过程的收益具有“高方差”时，收敛速度较慢
它只适用于分阶段式问题
它只适用于小的，有限的马尔可夫决策。

时序差分方法

The first problem is corrected by allowing the procedure to change the policy (at some or all states) before the values settle. This too may be problematic as it might prevent convergence. Most current algorithms do this, giving rise to the class of generalized policy iteration algorithms. Many actor critic methods belong to this category.

The first problem is corrected by allowing the procedure to change the policy (at some or all states) before the values settle. This too may be problematic as it might prevent convergence. Most current algorithms do this, giving rise to the class of generalized policy iteration algorithms. Many actor critic methods belong to this category.

第一个问题通过允许程序在值确定之前更改策略(在部分或全部状态)来得到纠正。这也可能也是有问题的，因为它可能会阻止收敛。目前大多数算法都是这样做的，所以一类广义策略迭代算法就产生了。许多“行动者-批评者”方法都属于这一范畴。

TThe second issue can be corrected by allowing trajectories to contribute to any state-action pair in them. This may also help to some extent with the third problem, although a better solution when returns have high variance is Sutton's temporal difference (TD) methods that are based on the recursive Bellman equation.^[9]模板:Sfn The computation in TD methods can be incremental (when after each transition the memory is changed and the transition is thrown away), or batch (when the transitions are batched and the estimates are computed once based on the batch). Batch methods, such as the least-squares temporal difference method,^[10] may use the information in the samples better, while incremental methods are the only choice when batch methods are infeasible due to their high computational or memory complexity. Some methods try to combine the two approaches. Methods based on temporal differences also overcome the fourth issue.

The second issue can be corrected by allowing trajectories to contribute to any state-action pair in them. This may also help to some extent with the third problem, although a better solution when returns have high variance is Sutton's temporal difference (TD) methods that are based on the recursive Bellman equation. The computation in TD methods can be incremental (when after each transition the memory is changed and the transition is thrown away), or batch (when the transitions are batched and the estimates are computed once based on the batch). Batch methods, such as the least-squares temporal difference method, may use the information in the samples better, while incremental methods are the only choice when batch methods are infeasible due to their high computational or memory complexity. Some methods try to combine the two approaches. Methods based on temporal differences also overcome the fourth issue.

第二个问题可以通过允许过程对其中的任何“状态-行动”对做出贡献而得到纠正。尽管收益具有高方差时，更好的解决方案是基于递归Bellman方程的Sutton的时序差分（TD）方法，但这在某种程度上也可以解决第三个问题。 TD方法中的计算可以是增量的（在每次转换后更改内存并丢弃转换），也可以是批处理（对转换进行批处理并且一次完成估算）。批处理方法（例如最小二乘时差方法）^[11]可能会更好地使用样本中的信息，而增量方法是由于批处理方法的高计算或内存复杂性而无法使用的唯一选择。一些方法尝试将两种方法结合起来。基于时序差分的方法也克服了第四个问题。

In order to address the fifth issue, function approximation methods are used. Linear function approximation starts with a mapping [math]\displaystyle{ \phi }[/math] that assigns a finite-dimensional vector to each state-action pair. Then, the action values of a state-action pair [math]\displaystyle{ (s,a) }[/math] are obtained by linearly combining the components of [math]\displaystyle{ \phi(s,a) }[/math] with some weights [math]\displaystyle{ \theta }[/math]:

In order to address the fifth issue, function approximation methods are used. Linear function approximation starts with a mapping [math]\displaystyle{ \phi }[/math] that assigns a finite-dimensional vector to each state-action pair. Then, the action values of a state-action pair [math]\displaystyle{ (s,a) }[/math] are obtained by linearly combining the components of [math]\displaystyle{ \phi(s,a) }[/math] with some weights [math]\displaystyle{ \theta }[/math]:

为了解决第五个问题，我们使用了函数逼近方法。线性函数逼近从一个映射 [math]\displaystyle{ \phi }[/math]开始，它为每个“状态-行动”元组分配一个有限维向量。然后，通过线性组合[math]\displaystyle{ \phi(s,a) }[/math]的分量和一些权重 [math]\displaystyle{ \theta }[/math] 的分量，得到状态动作对[math]\displaystyle{ (s,a) }[/math]的动作值:

[math]\displaystyle{ Q(s,a) = \sum_{i=1}^d \theta_i \phi_i(s,a). }[/math]

The algorithms then adjust the weights, instead of adjusting the values associated with the individual state-action pairs. Methods based on ideas from nonparametric statistics (which can be seen to construct their own features) have been explored.

然后，算法会调整权重，而不是调整与各个“状态-行动”对相关联的值。基于无参数统计的方法也被研究过了(可以看出它们构造了自己的特征)。

Value iteration can also be used as a starting point, giving rise to the Q-learning algorithm and its many variants.^[12]

值迭代也可以被用作开始点，催生出了Q-学习算法以及它的各种变体。^[13]

The problem with using action-values is that they may need highly precise estimates of the competing action values that can be hard to obtain when the returns are noisy, though this problem is mitigated to some extent by temporal difference methods. Using the so-called compatible function approximation method compromises generality and efficiency. Another problem specific to TD comes from their reliance on the recursive Bellman equation. Most TD methods have a so-called [math]\displaystyle{ \lambda }[/math] parameter [math]\displaystyle{ (0\le \lambda\le 1) }[/math] that can continuously interpolate between Monte Carlo methods that do not rely on the Bellman equations and the basic TD methods that rely entirely on the Bellman equations. This can be effective in palliating this issue.

使用“行动-值”的问题是，它们可能需要对相互竞争的动作值进行高度精确的估计，而当收益是有噪声时，这个问题可能很难得到，尽管时序差分方法在一定程度上缓解了这个问题。使用所谓的兼容函数逼近方法会损害通用性和效率。另一个特定于 TD 的问题来自于他们对递归贝尔曼方程的依赖。大多数 TD 方法都有一个所谓的λ参数(0≤λ≤1) ，可以在不依赖于 Bellman 方程的蒙特卡洛方法和完全依赖于 Bellman 方程的基本 TD 方法之间连续插值。这可以有效地缓解这个问题。

直接策略搜索

An alternative method is to search directly in (some subset of) the policy space, in which case the problem becomes a case of stochastic optimization. The two approaches available are gradient-based and gradient-free methods.

另一种方法是直接在策略空间(的子集)中搜索。使用这种方法，问题就变成了随机优化问题。现有的两种方法是基于梯度的方法和无梯度的方法。

Gradient-based methods (policy gradient methods) start with a mapping from a finite-dimensional (parameter) space to the space of policies: given the parameter vector [math]\displaystyle{ \theta }[/math], let [math]\displaystyle{ \pi_\theta }[/math] denote the policy associated to [math]\displaystyle{ \theta }[/math]. Defining the performance function by

基于梯度的方法(策略梯度方法)从有限维(参数)空间到策略空间的映射开始: 给定参数向量[math]\displaystyle{ \theta }[/math]，让[math]\displaystyle{ \pi_\theta }[/math]表示与[math]\displaystyle{ \theta }[/math]相关的策略。通过以下方式定义性能函数

[math]\displaystyle{ \rho(\theta) = \rho^{\pi_\theta}, }[/math]

under mild conditions this function will be differentiable as a function of the parameter vector [math]\displaystyle{ \theta }[/math]. If the gradient of [math]\displaystyle{ \rho }[/math] was known, one could use gradient ascent. Since an analytic expression for the gradient is not available, only a noisy estimate is available. Such an estimate can be constructed in many ways, giving rise to algorithms such as Williams' REINFORCE method^[14] (which is known as the likelihood ratio method in the simulation-based optimization literature).^[15] Policy search methods have been used in the robotics context.^[16] Many policy search methods may get stuck in local optima (as they are based on local search).

通常情况下，这个函数是关于参数[math]\displaystyle{ \theta }[/math]可微的。如果已知参数[math]\displaystyle{ \rho }[/math]的梯度，则可以直接使用梯度下降法。但是，我们不能得到关于梯度的解析式，只能得到一个有噪的估计量。这个估计量可以通过许多方法来构建，由此诞生了像REINFORCE这样的算法。^[17] (which is known as the likelihood ratio method in the simulation-based optimization literature).^[18] Policy search methods have been used in the robotics context.^[19] 许多策略搜索方法会陷入局部最优点（有哪位他们本就是基于局部搜索的优化方法）

A large class of methods avoids relying on gradient information. These include simulated annealing, cross-entropy search or methods of evolutionary computation. Many gradient-free methods can achieve (in theory and in the limit) a global optimum.

此外，还有大量避免了使用梯度信息的方法。这些方法包括模拟退火、交叉熵搜索或者演化计算搜索方法。许多无梯度方法可以(在理论上和极限的意义上)得到一个全局最优解。

Policy search methods may converge slowly given noisy data. For example, this happens in episodic problems when the trajectories are long and the variance of the returns is large. Value-function based methods that rely on temporal differences might help in this case. In recent years, actor–critic methods have been proposed and performed well on various problems.^[20]

Policy search methods may converge slowly given noisy data. For example, this happens in episodic problems when the trajectories are long and the variance of the returns is large. Value-function based methods that rely on temporal differences might help in this case. In recent years, actor–critic methods have been proposed and performed well on various problems.

政策搜索方法在处理有噪数据时可能会收敛得很缓慢。例如，这种情况在过程较漫长，且收益的方差较大的场景问题中就会发生。依赖于时间差异的基于价值函数的方法在这种情况下可能会有所帮助。近年来，“行动者–批评者”方法被提出，并在各种问题上取得了良好的效果。

理论

Both the asymptotic and finite-sample behavior of most algorithms is well understood. Algorithms with provably good online performance (addressing the exploration issue) are known.

人们已经能很好地理解大多数算法的渐近行为和有限样本行为，还知道了被证明具有良好在线性能(解决探索问题)的算法。

Efficient exploration of MDPs is given in Burnetas and Katehakis (1997).^[21]. Finite-time performance bounds have also appeared for many algorithms, but these bounds are expected to be rather loose and thus more work is needed to better understand the relative advantages and limitations.

Efficient exploration of MDPs is given in Burnetas and Katehakis (1997).. Finite-time performance bounds have also appeared for many algorithms, but these bounds are expected to be rather loose and thus more work is needed to better understand the relative advantages and limitations.

Burnetas 和 Katehakis (1997)^[22]给出了高效探索 MDPs 的方法。许多算法的有限时间性能界限也已被找到，但这些界限相当宽松，人们还需要做更多的工作，以更好地了解相对优势和局限性。

For incremental algorithms, asymptotic convergence issues have been settled模板:Clarify. Temporal-difference-based algorithms converge under a wider set of conditions than was previously possible (for example, when used with arbitrary, smooth function approximation).

For incremental algorithms, asymptotic convergence issues have been settled. Temporal-difference-based algorithms converge under a wider set of conditions than was previously possible (for example, when used with arbitrary, smooth function approximation).

增量算法的提出解决了渐近收敛问题。基于时间差异的算法能在比以前更广泛的条件下达到收敛(例如，与任意的、平滑的函数逼近一起使用时)。

研究

Research topics include

现在强化学习的研究课题包括：

adaptive methods that work with fewer (or no) parameters under a large number of conditions
addressing the exploration problem in large MDPs
large-scale empirical evaluations
learning and acting under partial information (e.g., using predictive state representation)
modular and hierarchical reinforcement learning^[23]
improving existing value-function and policy search methods
algorithms that work well with large (or continuous) action spaces
transfer learning^[24]
lifelong learning
efficient sample-based planning (e.g., based on Monte Carlo tree search).
bug detection in software projects^[25]
Intrinsic motivation which differentiates information-seeking, curiosity-type behaviours from task-dependent goal-directed behaviours (typically) by introducing a reward function based on maximising novel information^[26]^[27]^[28]
Multiagent or distributed reinforcement learning is a topic of interest. Applications are expanding.^[29]
Actor-critic reinforcement learning
Reinforcement learning algorithms such as TD learning are under investigation as a model for dopamine-based learning in the brain. In this model, the dopaminergic projections from the substantia nigra to the basal ganglia function as the prediction error. Reinforcement learning has been used as a part of the model for human skill learning, especially in relation to the interaction between implicit and explicit learning in skill acquisition (the first publication on this application was in 1995–1996).^[30]

在诸多条件下使用较少（或没有）参数的自适应方法
解决在大规模MDPs里的探索问题
大规模实证评价
部分可观测马尔可夫决策过程下的学习与行动
模块化分层的强化学习
改进现有的价值函数和策略搜索方法
适用于大型（或连续）动作空间的算法
迁移学习^[31]
终生学习
基于样本的高效规划算法（比如，基于蒙特卡洛搜索树的方法）
软件项目中的缺陷检测 s^[32]
通过引入最大化新信息的奖励函数来区分主体（agent）信息收集等好奇行为，以及任务导向等有目标行为的不同内在动机 ^[26]^[27]^[28]
多主体或分布式强化学习也是令人感兴趣的话题，这方面的应用正在不断扩大。
“行动者-批评家”强化学习
像 TD 学习这样的强化学习学习算法被研究用来建模大脑中基于多巴胺的学习模型。在这个模型中，从黑质到基底神经节的多巴胺能投射可以被视为预测错误。强化学习被用作是人类技能学习模型的一部分，特别是关于技能习得中内隐和外显学习之间的相互作用(这一应用的第一次出版是在1995-1996年)。

强化学习算法比较

Algorithm	Description	Model	Policy	Action Space	State Space	Operator
Monte Carlo	Every visit to Monte Carlo	Model-Free	Either	Discrete	Discrete	Sample-means
Q-learning	State–action–reward–state	Model-Free	Off-policy	Discrete	Discrete	Q-value
SARSA	State–action–reward–state–action	Model-Free	On-policy	Discrete	Discrete	Q-value
Q-learning - Lambda	State–action–reward–state with eligibility traces	Model-Free	Off-policy	Discrete	Discrete	Q-value
SARSA - Lambda	State–action–reward–state–action with eligibility traces	Model-Free	On-policy	Discrete	Discrete	Q-value
DQN	Deep Q Network	Model-Free	Off-policy	Discrete	Continuous	Q-value
DDPG	Deep Deterministic Policy Gradient	Model-Free	Off-policy	Continuous	Continuous	Q-value
A3C	Asynchronous Advantage Actor-Critic Algorithm	Model-Free	On-policy	Continuous	Continuous	Advantage
NAF	Q-Learning with Normalized Advantage Functions	Model-Free	Off-policy	Continuous	Continuous	Advantage
TRPO	Trust Region Policy Optimization	Model-Free	On-policy	Continuous	Continuous	Advantage
PPO	Proximal Policy Optimization	Model-Free	On-policy	Continuous	Continuous	Advantage
TD3	Twin Delayed Deep Deterministic Policy Gradient	Model-Free	Off-policy	Continuous	Continuous	Q-value
SAC	Soft Actor-Critic	Model-Free	Off-policy	Continuous	Continuous	Advantage

算法	描述	模型	策略	行动空间	状态空间	操作符
蒙特卡洛	采样	无模型	Either	离散	离散	样本均值
Q-learning	状态–行动–奖励–状态	无模型	Off	离散	离散	Q值
SARSA	状态–行动–奖励–状态-行动	无模型	On	离散	离散	Q值
Q-learning - Lambda	带资格痕迹的状态–行动–奖励–状态	无模型	Off	离散	离散	Q值
SARSA - Lambda	带资格痕迹的状态–行动–奖励–状态-行动	无模型	On	离散	离散	Q值
DQN	深度Q网络	无模型	Off	离散	连续	Q值
DDPG	深度确定性策略梯度	无模型	Off	连续	连续	Q值
A3C	异步优势行“动者-批评者”算法	无模型	On	连续	连续	优势（Advantage）
NAF	带归一化优势函数的Q-学习	无模型	Off	连续	连续	优势
TRPO	信任区域策略的最优化	无模型	On	连续	连续	优势
PPO	近端策略最优化	无模型	On	连续	连续	优势
TD3	双延迟深度确定性策略梯度	无模型	Off	连续	连续	Q-值
SAC	软“行动者-批评者”	无模型	Off	连续	连续	优势

深度强度学习

This approach extends reinforcement learning by using a deep neural network and without explicitly designing the state space.^[33] The work on learning ATARI games by Google DeepMind increased attention to deep reinforcement learning or end-to-end reinforcement learning.^[34]

This approach extends reinforcement learning by using a deep neural network and without explicitly designing the state space. The work on learning ATARI games by Google DeepMind increased attention to deep reinforcement learning or end-to-end reinforcement learning.

这种方法通过使用深层神经网络来扩展强化学习，因此不需要明确地设计状态空间。^[33]谷歌 DeepMind 在雅达利游戏上的研究提高了人们对强化学习或端到端强化学习游戏的关注。^[34]

逆强化学习

In inverse reinforcement learning (IRL), no reward function is given. Instead, the reward function is inferred given an observed behavior from an expert. The idea is to mimic observed behavior, which is often optimal or close to optimal.^[35]

In inverse reinforcement learning (IRL), no reward function is given. Instead, the reward function is inferred given an observed behavior from an expert. The idea is to mimic observed behavior, which is often optimal or close to optimal.

在逆强化学习中，奖励函数是没有给出的。相反，奖励功能是根据观察到的专家的行为推断出来的。这个想法是为了模仿观察到的行为，且通常是最优的或接近最优的。^[36]

学徒学习

In apprenticeship learning, an expert demonstrates the target behavior. The system tries to recover the policy via observation.

在学徒学习过程中，师傅会展示目标行为，系统（学徒）试图通过观察来还原策略。

References

参考

↑ ^1.0 ^1.1 Kaelbling, Leslie P.; Littman, Michael L.; Moore, Andrew W. (1996). "Reinforcement Learning: A Survey". Journal of Artificial Intelligence Research. 4: 237–285. arXiv:cs/9605103. doi:10.1613/jair.301. Archived from the original on 2001-11-20. {{cite journal}}: Invalid |ref=harv (help)
↑ van Otterlo, M.; Wiering, M. (2012). Reinforcement learning and markov decision processes. Adaptation, Learning, and Optimization. 12. pp. 3–42. doi:10.1007/978-3-642-27645-3_1. ISBN 978-3-642-27644-6.
↑ Gosavi, Abhijit (2003). Simulation-based Optimization: Parametric Optimization Techniques and Reinforcement. Operations Research/Computer Science Interfaces Series. Springer. ISBN 978-1-4020-7454-7. https://www.springer.com/mathematics/applications/book/978-1-4020-7454-7.
↑ Gosavi, Abhijit (2003). Simulation-based Optimization: Parametric Optimization Techniques and Reinforcement. Operations Research/Computer Science Interfaces Series. Springer. ISBN 978-1-4020-7454-7. https://www.springer.com/mathematics/applications/book/978-1-4020-7454-7.
↑ Burnetas, Apostolos N.; Katehakis, Michael N. (1997), "Optimal adaptive policies for Markov Decision Processes", Mathematics of Operations Research, 22: 222–255, doi:10.1287/moor.22.1.222 {{citation}}: Invalid |ref=harv (help)
↑ Burnetas, Apostolos N.; Katehakis, Michael N. (1997), "Optimal adaptive policies for Markov Decision Processes", Mathematics of Operations Research, 22: 222–255, doi:10.1287/moor.22.1.222 {{citation}}: Invalid |ref=harv (help)
↑ Tokic, Michel; Palm, Günther (2011), "Value-Difference Based Exploration: Adaptive Control Between Epsilon-Greedy and Softmax" (PDF), KI 2011: Advances in Artificial Intelligence, Lecture Notes in Computer Science, vol. 7006, Springer, pp. 335–346, ISBN 978-3-642-24455-1
↑ ^8.0 ^8.1 ^8.2 Reinforcement learning: An introduction. http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf.
↑ Sutton, Richard S. (1984). Temporal Credit Assignment in Reinforcement Learning (PhD thesis). University of Massachusetts, Amherst, MA.
↑ Bradtke, Steven J.; Barto, Andrew G. (1996). "Learning to predict by the method of temporal differences". Machine Learning. 22: 33–57. CiteSeerX 10.1.1.143.857. doi:10.1023/A:1018056104778. S2CID 20327856.
↑ Bradtke, Steven J.; Barto, Andrew G. (1996). "Learning to predict by the method of temporal differences". Machine Learning. 22: 33–57. CiteSeerX 10.1.1.143.857. doi:10.1023/A:1018056104778. S2CID 20327856.
↑ Watkins, Christopher J.C.H. (1989). Learning from Delayed Rewards (PDF) (PhD thesis). King’s College, Cambridge, UK.
↑ Watkins, Christopher J.C.H. (1989). Learning from Delayed Rewards (PDF) (PhD thesis). King’s College, Cambridge, UK.
↑ Williams, Ronald J. (1987). A class of gradient-estimating algorithms for reinforcement learning in neural networks. CiteSeerX 10.1.1.129.8871. {{cite conference}}: Unknown parameter |booktitle= ignored (help)
↑ Peters, Jan; Vijayakumar, Sethu; Schaal, Stefan (2003). Reinforcement Learning for Humanoid Robotics (PDF). {{cite conference}}: Unknown parameter |booktitle= ignored (help)
↑ Deisenroth, Marc Peter; Neumann, Gerhard; Peters, Jan (2013). A Survey on Policy Search for Robotics. Foundations and Trends in Robotics. 2. NOW Publishers. pp. 1–142. doi:10.1561/2300000021. http://eprints.lincoln.ac.uk/28029/1/PolicySearchReview.pdf.
↑ Williams, Ronald J. (1987). A class of gradient-estimating algorithms for reinforcement learning in neural networks. CiteSeerX 10.1.1.129.8871. {{cite conference}}: Unknown parameter |booktitle= ignored (help)
↑ Peters, Jan; Vijayakumar, Sethu; Schaal, Stefan (2003). Reinforcement Learning for Humanoid Robotics (PDF). {{cite conference}}: Unknown parameter |booktitle= ignored (help)
↑ Deisenroth, Marc Peter; Neumann, Gerhard; Peters, Jan (2013). A Survey on Policy Search for Robotics. Foundations and Trends in Robotics. 2. NOW Publishers. pp. 1–142. doi:10.1561/2300000021. http://eprints.lincoln.ac.uk/28029/1/PolicySearchReview.pdf.
↑ Juliani, Arthur (2016-12-17). "Simple Reinforcement Learning with Tensorflow Part 8: Asynchronous Actor-Critic Agents (A3C)". Medium. Retrieved 2018-02-22.
↑ Burnetas, Apostolos N.; Katehakis, Michael N. (1997), "Optimal adaptive policies for Markov Decision Processes", Mathematics of Operations Research, 22: 222–255, doi:10.1287/moor.22.1.222 {{citation}}: Invalid |ref=harv (help)
↑ Burnetas, Apostolos N.; Katehakis, Michael N. (1997), "Optimal adaptive policies for Markov Decision Processes", Mathematics of Operations Research, 22: 222–255, doi:10.1287/moor.22.1.222 {{citation}}: Invalid |ref=harv (help)
↑ Kulkarni, Tejas D.; Narasimhan, Karthik R.; Saeedi, Ardavan; Tenenbaum, Joshua B. (2016). "Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation". Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS'16. USA: Curran Associates Inc.: 3682–3690. arXiv:1604.06057. Bibcode:2016arXiv160406057K. ISBN 978-1-5108-3881-9.
↑ George Karimpanal, Thommen; Bouffanais, Roland (2019). "Self-organizing maps for storage and transfer of knowledge in reinforcement learning". Adaptive Behavior (in English). 27 (2): 111–126. arXiv:1811.08318. doi:10.1177/1059712318818568. ISSN 1059-7123.
↑ "On the Use of Reinforcement Learning for Testing Game Mechanics : ACM - Computers in Entertainment". cie.acm.org (in English). Retrieved 2018-11-27.
↑ ^26.0 ^26.1 Kaplan, F. and Oudeyer, P. (2004). Maximizing learning progress: an internal reward system for development. Embodied artificial intelligence, pages 629–629.
↑ ^27.0 ^27.1 Klyubin, A., Polani, D., and Nehaniv, C. (2008). Keep your options open: an information-based driving principle for sensorimotor systems. PloS ONE, 3(12):e4018. https://dx.doi.org/10.1371%2Fjournal.pone.0004018
↑ ^28.0 ^28.1 Barto, A. G. (2013). “Intrinsic motivation and reinforcement learning,” in Intrinsically Motivated Learning in Natural and Artificial Systems (Berlin; Heidelberg: Springer), 17–47
↑ "Reinforcement Learning / Successes of Reinforcement Learning". umichrl.pbworks.com. Retrieved 2017-08-06.
↑ [1] -{zh-cn:互联网档案馆; zh-tw:網際網路檔案館; zh-hk:互聯網檔案館;}-的存檔，存档日期2017-04-26.
↑ George Karimpanal, Thommen; Bouffanais, Roland (2019). "Self-organizing maps for storage and transfer of knowledge in reinforcement learning". Adaptive Behavior (in English). 27 (2): 111–126. arXiv:1811.08318. doi:10.1177/1059712318818568. ISSN 1059-7123.
↑ "On the Use of Reinforcement Learning for Testing Game Mechanics : ACM - Computers in Entertainment". cie.acm.org (in English). Retrieved 2018-11-27.
↑ ^33.0 ^33.1 Francois-Lavet, Vincent; et al. (2018). "An Introduction to Deep Reinforcement Learning". Foundations and Trends in Machine Learning. 11 (3–4): 219–354. arXiv:1811.12560. Bibcode:2018arXiv181112560F. doi:10.1561/2200000071.
↑ ^34.0 ^34.1 Mnih, Volodymyr; et al. (2015). "Human-level control through deep reinforcement learning". Nature. 518 (7540): 529–533. Bibcode:2015Natur.518..529M. doi:10.1038/nature14236. PMID 25719670.
↑ Ng, A. Y.; Russell, S. J. (2000). "Algorithms for Inverse Reinforcement Learning". Proceeding ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning. pp. 663–670. ISBN 1-55860-707-2. https://ai.stanford.edu/~ang/papers/icml00-irl.pdf.
↑ Ng, A. Y.; Russell, S. J. (2000). "Algorithms for Inverse Reinforcement Learning". Proceeding ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning. pp. 663–670. ISBN 1-55860-707-2. https://ai.stanford.edu/~ang/papers/icml00-irl.pdf.

延伸阅读

Auer, Peter; Jaksch, Thomas; Ortner, Ronald (2010). "Near-optimal regret bounds for reinforcement learning". Journal of Machine Learning Research. 11: 1563–1600.
Busoniu, Lucian; Babuska, Robert; De Schutter, Bart; Ernst, Damien (2010). Reinforcement Learning and Dynamic Programming using Function Approximators. Taylor & Francis CRC Press. ISBN 978-1-4398-2108-4. http://www.dcsc.tudelft.nl/rlbook/.
François-Lavet, Vincent; Henderson, Peter; Islam, Riashat; Bellemare, Marc G.; Pineau, Joelle (2018). "An Introduction to Deep Reinforcement Learning". Foundations and Trends in Machine Learning. 11 (3–4): 219–354. arXiv:1811.12560. Bibcode:2018arXiv181112560F. doi:10.1561/2200000071. S2CID 54434537.
Powell, Warren (2007). Approximate dynamic programming: solving the curses of dimensionality. Wiley-Interscience. ISBN 978-0-470-17155-4. http://www.castlelab.princeton.edu/adp.htm.
Sutton, Richard S.; Barto, Andrew G. (2018). Reinforcement Learning: An Introduction (2 ed.). MIT Press. ISBN 978-0-262-03924-6. http://incompleteideas.net/sutton/book/the-book.html.
Sutton, Richard S. (1988). "Learning to predict by the method of temporal differences". Machine Learning. 3: 9–44. doi:10.1007/BF00115009.
Szita, Istvan; Szepesvari, Csaba (2010). Model-based Reinforcement Learning with Nearly Tight Exploration Complexity Bounds (PDF). Omnipress. pp. 1031–1038. Archived from the original (PDF) on 2010-07-14. {{cite conference}}: Unknown parameter |booktitle= ignored (help)

External links

麻省大学阿默斯特分校强化学习实验室

阿尔伯塔大学强化学习与人工智能实验室

麻省大学阿默斯特分校自动化学习实验室

混合强化学习介绍

真实世界中的强化学习实验

吴恩达强化学习介绍

强化学习剖析强化学习系列博客文章及相应Python代码

模板:Computer science

Category:Markov models

类别: 马尔科夫模型

Category:Belief revision

类别: 信念修正

This page was moved from wikipedia:en:Reinforcement learning. Its edit history can be viewed at 增强学习/edithistory

[kaelbling-1] 1.0 ^1.1 Kaelbling, Leslie P.; Littman, Michael L.; Moore, Andrew W. (1996). "Reinforcement Learning: A Survey". Journal of Artificial Intelligence Research. 4: 237–285. arXiv:cs/9605103. doi:10.1613/jair.301. Archived from the original on 2001-11-20. {{cite journal}}: Invalid |ref=harv (help)

[2] van Otterlo, M.; Wiering, M. (2012). Reinforcement learning and markov decision processes. Adaptation, Learning, and Optimization. 12. pp. 3–42. doi:10.1007/978-3-642-27645-3_1. ISBN 978-3-642-27644-6.

[3] Gosavi, Abhijit (2003). Simulation-based Optimization: Parametric Optimization Techniques and Reinforcement. Operations Research/Computer Science Interfaces Series. Springer. ISBN 978-1-4020-7454-7. https://www.springer.com/mathematics/applications/book/978-1-4020-7454-7.

[4] Gosavi, Abhijit (2003). Simulation-based Optimization: Parametric Optimization Techniques and Reinforcement. Operations Research/Computer Science Interfaces Series. Springer. ISBN 978-1-4020-7454-7. https://www.springer.com/mathematics/applications/book/978-1-4020-7454-7.

[5] Burnetas, Apostolos N.; Katehakis, Michael N. (1997), "Optimal adaptive policies for Markov Decision Processes", Mathematics of Operations Research, 22: 222–255, doi:10.1287/moor.22.1.222 {{citation}}: Invalid |ref=harv (help)

[6] Burnetas, Apostolos N.; Katehakis, Michael N. (1997), "Optimal adaptive policies for Markov Decision Processes", Mathematics of Operations Research, 22: 222–255, doi:10.1287/moor.22.1.222 {{citation}}: Invalid |ref=harv (help)

[7] Tokic, Michel; Palm, Günther (2011), "Value-Difference Based Exploration: Adaptive Control Between Epsilon-Greedy and Softmax" (PDF), KI 2011: Advances in Artificial Intelligence, Lecture Notes in Computer Science, vol. 7006, Springer, pp. 335–346, ISBN 978-3-642-24455-1

[:0-8] 8.0 ^8.1 ^8.2 Reinforcement learning: An introduction. http://people.inf.elte.hu/lorincz/Files/RL_2006/SuttonBook.pdf.

[9] Sutton, Richard S. (1984). Temporal Credit Assignment in Reinforcement Learning (PhD thesis). University of Massachusetts, Amherst, MA.

[10] Bradtke, Steven J.; Barto, Andrew G. (1996). "Learning to predict by the method of temporal differences". Machine Learning. 22: 33–57. CiteSeerX 10.1.1.143.857. doi:10.1023/A:1018056104778. S2CID 20327856.

[11] Bradtke, Steven J.; Barto, Andrew G. (1996). "Learning to predict by the method of temporal differences". Machine Learning. 22: 33–57. CiteSeerX 10.1.1.143.857. doi:10.1023/A:1018056104778. S2CID 20327856.

[12] Watkins, Christopher J.C.H. (1989). Learning from Delayed Rewards (PDF) (PhD thesis). King’s College, Cambridge, UK.

[13] Watkins, Christopher J.C.H. (1989). Learning from Delayed Rewards (PDF) (PhD thesis). King’s College, Cambridge, UK.

[14] Williams, Ronald J. (1987). A class of gradient-estimating algorithms for reinforcement learning in neural networks. CiteSeerX 10.1.1.129.8871. {{cite conference}}: Unknown parameter |booktitle= ignored (help)

[15] Peters, Jan; Vijayakumar, Sethu; Schaal, Stefan (2003). Reinforcement Learning for Humanoid Robotics (PDF). {{cite conference}}: Unknown parameter |booktitle= ignored (help)

[16] Deisenroth, Marc Peter; Neumann, Gerhard; Peters, Jan (2013). A Survey on Policy Search for Robotics. Foundations and Trends in Robotics. 2. NOW Publishers. pp. 1–142. doi:10.1561/2300000021. http://eprints.lincoln.ac.uk/28029/1/PolicySearchReview.pdf.

[17] Williams, Ronald J. (1987). A class of gradient-estimating algorithms for reinforcement learning in neural networks. CiteSeerX 10.1.1.129.8871. {{cite conference}}: Unknown parameter |booktitle= ignored (help)

[18] Peters, Jan; Vijayakumar, Sethu; Schaal, Stefan (2003). Reinforcement Learning for Humanoid Robotics (PDF). {{cite conference}}: Unknown parameter |booktitle= ignored (help)

[19] Deisenroth, Marc Peter; Neumann, Gerhard; Peters, Jan (2013). A Survey on Policy Search for Robotics. Foundations and Trends in Robotics. 2. NOW Publishers. pp. 1–142. doi:10.1561/2300000021. http://eprints.lincoln.ac.uk/28029/1/PolicySearchReview.pdf.

[20] Juliani, Arthur (2016-12-17). "Simple Reinforcement Learning with Tensorflow Part 8: Asynchronous Actor-Critic Agents (A3C)". Medium. Retrieved 2018-02-22.

[21] Burnetas, Apostolos N.; Katehakis, Michael N. (1997), "Optimal adaptive policies for Markov Decision Processes", Mathematics of Operations Research, 22: 222–255, doi:10.1287/moor.22.1.222 {{citation}}: Invalid |ref=harv (help)

[22] Burnetas, Apostolos N.; Katehakis, Michael N. (1997), "Optimal adaptive policies for Markov Decision Processes", Mathematics of Operations Research, 22: 222–255, doi:10.1287/moor.22.1.222 {{citation}}: Invalid |ref=harv (help)

[23] Kulkarni, Tejas D.; Narasimhan, Karthik R.; Saeedi, Ardavan; Tenenbaum, Joshua B. (2016). "Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation". Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS'16. USA: Curran Associates Inc.: 3682–3690. arXiv:1604.06057. Bibcode:2016arXiv160406057K. ISBN 978-1-5108-3881-9.

[24] George Karimpanal, Thommen; Bouffanais, Roland (2019). "Self-organizing maps for storage and transfer of knowledge in reinforcement learning". Adaptive Behavior (in English). 27 (2): 111–126. arXiv:1811.08318. doi:10.1177/1059712318818568. ISSN 1059-7123.

[25] "On the Use of Reinforcement Learning for Testing Game Mechanics : ACM - Computers in Entertainment". cie.acm.org (in English). Retrieved 2018-11-27.

[kaplan2004-26] 26.0 ^26.1 Kaplan, F. and Oudeyer, P. (2004). Maximizing learning progress: an internal reward system for development. Embodied artificial intelligence, pages 629–629.

[klyubin2008-27] 27.0 ^27.1 Klyubin, A., Polani, D., and Nehaniv, C. (2008). Keep your options open: an information-based driving principle for sensorimotor systems. PloS ONE, 3(12):e4018. https://dx.doi.org/10.1371%2Fjournal.pone.0004018

[barto2013-28] 28.0 ^28.1 Barto, A. G. (2013). “Intrinsic motivation and reinforcement learning,” in Intrinsically Motivated Learning in Natural and Artificial Systems (Berlin; Heidelberg: Springer), 17–47

[29] "Reinforcement Learning / Successes of Reinforcement Learning". umichrl.pbworks.com. Retrieved 2017-08-06.

[30] [1] -{zh-cn:互联网档案馆; zh-tw:網際網路檔案館; zh-hk:互聯網檔案館;}-的存檔，存档日期2017-04-26.

[31] George Karimpanal, Thommen; Bouffanais, Roland (2019). "Self-organizing maps for storage and transfer of knowledge in reinforcement learning". Adaptive Behavior (in English). 27 (2): 111–126. arXiv:1811.08318. doi:10.1177/1059712318818568. ISSN 1059-7123.

[32] "On the Use of Reinforcement Learning for Testing Game Mechanics : ACM - Computers in Entertainment". cie.acm.org (in English). Retrieved 2018-11-27.

[intro_deep_RL-33] 33.0 ^33.1 Francois-Lavet, Vincent; et al. (2018). "An Introduction to Deep Reinforcement Learning". Foundations and Trends in Machine Learning. 11 (3–4): 219–354. arXiv:1811.12560. Bibcode:2018arXiv181112560F. doi:10.1561/2200000071.

[DQN2-34] 34.0 ^34.1 Mnih, Volodymyr; et al. (2015). "Human-level control through deep reinforcement learning". Nature. 518 (7540): 529–533. Bibcode:2015Natur.518..529M. doi:10.1038/nature14236. PMID 25719670.

[35] Ng, A. Y.; Russell, S. J. (2000). "Algorithms for Inverse Reinforcement Learning". Proceeding ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning. pp. 663–670. ISBN 1-55860-707-2. https://ai.stanford.edu/~ang/papers/icml00-irl.pdf.

[36] Ng, A. Y.; Russell, S. J. (2000). "Algorithms for Inverse Reinforcement Learning". Proceeding ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning. pp. 663–670. ISBN 1-55860-707-2. https://ai.stanford.edu/~ang/papers/icml00-irl.pdf.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]