更改

跳到导航 跳到搜索
删除3,477字节 、 2024年2月21日 (星期三)
无编辑摘要
第11行: 第11行:       −
'''Reinforcement learning''' ('''RL''') is an area of [[machine learning]] concerned with how [[software agent]]s ought to take [[Action selection|actions]] in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside [[supervised learning]] and [[unsupervised learning]].
+
'''Reinforcement learning''' ('''RL''') is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.
    
Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.
 
Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.
第27行: 第27行:       −
The environment is typically stated in the form of a [[Markov decision process]] (MDP), because many reinforcement learning algorithms for this context utilize [[dynamic programming]] techniques.<ref>{{Cite book|title=Reinforcement learning and markov decision processes|author1=van Otterlo, M.|author2=Wiering, M.|journal=Reinforcement Learning |volume=12|pages=3–42 |year=2012 |doi=10.1007/978-3-642-27645-3_1|series=Adaptation, Learning, and Optimization|isbn=978-3-642-27644-6}}</ref> The main difference between the classical dynamic programming methods  and reinforcement learning algorithms is that the latter do not assume knowledge of an exact mathematical model of the MDP and they target large MDPs where exact methods become infeasible.{{toclimit|3}}
+
The environment is typically stated in the form of a Markov decision process (MDP), because many reinforcement learning algorithms for this context utilize dynamic programming techniques.<ref>{{Cite book|title=Reinforcement learning and markov decision processes|author1=van Otterlo, M.|author2=Wiering, M.|journal=Reinforcement Learning |volume=12|pages=3–42 |year=2012 |doi=10.1007/978-3-642-27645-3_1|series=Adaptation, Learning, and Optimization|isbn=978-3-642-27644-6}}</ref> The main difference between the classical dynamic programming methods  and reinforcement learning algorithms is that the latter do not assume knowledge of an exact mathematical model of the MDP and they target large MDPs where exact methods become infeasible.
    
The environment is typically stated in the form of a Markov decision process (MDP), because many reinforcement learning algorithms for this context utilize dynamic programming techniques. The main difference between the classical dynamic programming methods  and reinforcement learning algorithms is that the latter do not assume knowledge of an exact mathematical model of the MDP and they target large MDPs where exact methods become infeasible.
 
The environment is typically stated in the form of a Markov decision process (MDP), because many reinforcement learning algorithms for this context utilize dynamic programming techniques. The main difference between the classical dynamic programming methods  and reinforcement learning algorithms is that the latter do not assume knowledge of an exact mathematical model of the MDP and they target large MDPs where exact methods become infeasible.
第45行: 第45行:       −
Reinforcement learning, due to its generality, is studied in many other disciplines, such as [[game theory]], [[control theory]], [[operations research]], [[information theory]], [[simulation-based optimization]], [[multi-agent system]]s, [[swarm intelligence]], [[statistics]] and [[genetic algorithm]]s. In the operations research and control literature, reinforcement learning is called ''approximate dynamic programming,'' or ''neuro-dynamic programming.'' The problems of interest in reinforcement learning have also been studied in the [[optimal control theory|theory of optimal control]], which is concerned mostly with the existence and characterization of optimal solutions, and algorithms for their exact computation, and less with learning or approximation, particularly in the absence of a mathematical model of the environment. In [[economics]] and [[game theory]], reinforcement learning may be used to explain how equilibrium may arise under [[bounded rationality]].
+
Reinforcement learning, due to its generality, is studied in many other disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, multi-agent systems, swarm intelligence, statistics and genetic algorithms. In the operations research and control literature, reinforcement learning is called ''approximate dynamic programming,'' or ''neuro-dynamic programming.'' The problems of interest in reinforcement learning have also been studied in the theory of optimal control, which is concerned mostly with the existence and characterization of optimal solutions, and algorithms for their exact computation, and less with learning or approximation, particularly in the absence of a mathematical model of the environment. In economics and game theory, reinforcement learning may be used to explain how equilibrium may arise under bounded rationality.
    
Reinforcement learning, due to its generality, is studied in many other disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, multi-agent systems, swarm intelligence, statistics and genetic algorithms. In the operations research and control literature, reinforcement learning is called approximate dynamic programming, or neuro-dynamic programming. The problems of interest in reinforcement learning have also been studied in the theory of optimal control, which is concerned mostly with the existence and characterization of optimal solutions, and algorithms for their exact computation, and less with learning or approximation, particularly in the absence of a mathematical model of the environment. In economics and game theory, reinforcement learning may be used to explain how equilibrium may arise under bounded rationality.
 
Reinforcement learning, due to its generality, is studied in many other disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, multi-agent systems, swarm intelligence, statistics and genetic algorithms. In the operations research and control literature, reinforcement learning is called approximate dynamic programming, or neuro-dynamic programming. The problems of interest in reinforcement learning have also been studied in the theory of optimal control, which is concerned mostly with the existence and characterization of optimal solutions, and algorithms for their exact computation, and less with learning or approximation, particularly in the absence of a mathematical model of the environment. In economics and game theory, reinforcement learning may be used to explain how equilibrium may arise under bounded rationality.
第53行: 第53行:       −
Basic reinforcement is modeled as a [[Markov decision process]]:
+
Basic reinforcement is modeled as a Markov decision process:
    
Basic reinforcement is modeled as a Markov decision process:
 
Basic reinforcement is modeled as a Markov decision process:
第74行: 第74行:  
*描述主体要遵守的规则
 
*描述主体要遵守的规则
   −
Rules are often [[stochastic]]. The observation typically involves the scalar, immediate reward associated with the last transition. In many works, the agent is assumed to observe the current environmental state (''full observability''). If not, the agent has ''partial observability''. Sometimes the set of actions available to the agent is restricted (a zero balance cannot be reduced. For example, if the current value of the agent is 3 and the state transition reduces the value by 4, the transition will not be allowed).
+
Rules are often stochastic. The observation typically involves the scalar, immediate reward associated with the last transition. In many works, the agent is assumed to observe the current environmental state (''full observability''). If not, the agent has ''partial observability''. Sometimes the set of actions available to the agent is restricted (a zero balance cannot be reduced. For example, if the current value of the agent is 3 and the state transition reduces the value by 4, the transition will not be allowed).
    
Rules are often stochastic. The observation typically involves the scalar, immediate reward associated with the last transition. In many works, the agent is assumed to observe the current environmental state (full observability). If not, the agent has partial observability. Sometimes the set of actions available to the agent is restricted (a zero balance cannot be reduced. For example, if the current value of the agent is 3 and the state transition reduces the value by 4, the transition will not be allowed).
 
Rules are often stochastic. The observation typically involves the scalar, immediate reward associated with the last transition. In many works, the agent is assumed to observe the current environmental state (full observability). If not, the agent has partial observability. Sometimes the set of actions available to the agent is restricted (a zero balance cannot be reduced. For example, if the current value of the agent is 3 and the state transition reduces the value by 4, the transition will not be allowed).
第82行: 第82行:       −
A reinforcement learning agent interacts with its environment in discrete time steps. At each time {{mvar|t}}, the agent receives an observation <math>o_t</math>, which typically includes the reward <math>r_t</math>. It then chooses an action <math>a_t</math> from the set of available actions, which is subsequently sent to the environment. The environment moves to a new state <math>s_{t+1}</math> and the reward <math>r_{t+1}</math> associated with the ''transition'' <math>(s_t,a_t,s_{t+1})</math> is determined. The goal of a reinforcement learning agent is to collect as much reward as possible. The [[Software agent|agent]] can (possibly randomly) choose any action as a function of the history.
+
A reinforcement learning agent interacts with its environment in discrete time steps. At each time {{mvar|t}}, the agent receives an observation <math>o_t</math>, which typically includes the reward <math>r_t</math>. It then chooses an action <math>a_t</math> from the set of available actions, which is subsequently sent to the environment. The environment moves to a new state <math>s_{t+1}</math> and the reward <math>r_{t+1}</math> associated with the ''transition'' <math>(s_t,a_t,s_{t+1})</math> is determined. The goal of a reinforcement learning agent is to collect as much reward as possible. The agent can (possibly randomly) choose any action as a function of the history.
    
A reinforcement learning agent interacts with its environment in discrete time steps. At each time , the agent receives an observation <math>o_t</math>, which typically includes the reward <math>r_t</math>. It then chooses an action <math>a_t</math> from the set of available actions, which is subsequently sent to the environment. The environment moves to a new state <math>s_{t+1}</math> and the reward <math>r_{t+1}</math> associated with the transition <math>(s_t,a_t,s_{t+1})</math> is determined. The goal of a reinforcement learning agent is to collect as much reward as possible. The agent can (possibly randomly) choose any action as a function of the history.
 
A reinforcement learning agent interacts with its environment in discrete time steps. At each time , the agent receives an observation <math>o_t</math>, which typically includes the reward <math>r_t</math>. It then chooses an action <math>a_t</math> from the set of available actions, which is subsequently sent to the environment. The environment moves to a new state <math>s_{t+1}</math> and the reward <math>r_{t+1}</math> associated with the transition <math>(s_t,a_t,s_{t+1})</math> is determined. The goal of a reinforcement learning agent is to collect as much reward as possible. The agent can (possibly randomly) choose any action as a function of the history.
第90行: 第90行:       −
When the agent's performance is compared to that of an agent that acts optimally, the difference in performance gives rise to the notion of ''[[regret (game theory)|regret]]''. In order to act near optimally, the agent must reason about the long term consequences of its actions (i.e., maximize future income), although the immediate reward associated with this might be negative.
+
When the agent's performance is compared to that of an agent that acts optimally, the difference in performance gives rise to the notion of regret. In order to act near optimally, the agent must reason about the long term consequences of its actions (i.e., maximize future income), although the immediate reward associated with this might be negative.
    
When the agent's performance is compared to that of an agent that acts optimally, the difference in performance gives rise to the notion of regret. In order to act near optimally, the agent must reason about the long term consequences of its actions (i.e., maximize future income), although the immediate reward associated with this might be negative.
 
When the agent's performance is compared to that of an agent that acts optimally, the difference in performance gives rise to the notion of regret. In order to act near optimally, the agent must reason about the long term consequences of its actions (i.e., maximize future income), although the immediate reward associated with this might be negative.
第98行: 第98行:       −
Thus, reinforcement learning is particularly well-suited to problems that include a long-term versus short-term reward trade-off. It has been applied successfully to various problems, including [[robot control]], elevator scheduling, [[telecommunications]], [[backgammon]], [[checkers]]{{Sfn|Sutton|Barto|p=|loc=Chapter 11}} and [[Go (game)|Go]] ([[AlphaGo]]).
+
Thus, reinforcement learning is particularly well-suited to problems that include a long-term versus short-term reward trade-off. It has been applied successfully to various problems, including robot control, elevator scheduling, telecommunications, backgammon, checkers and Go (AlphaGo).
    
Thus, reinforcement learning is particularly well-suited to problems that include a long-term versus short-term reward trade-off. It has been applied successfully to various problems, including robot control, elevator scheduling, telecommunications, backgammon, checkers and Go (AlphaGo).
 
Thus, reinforcement learning is particularly well-suited to problems that include a long-term versus short-term reward trade-off. It has been applied successfully to various problems, including robot control, elevator scheduling, telecommunications, backgammon, checkers and Go (AlphaGo).
第112行: 第112行:  
使得强化学习功能强大的因素有两个: 使用样本来优化性能和使用函数逼近来处理大型环境。正因为有了这两个关键组件,强化学习可以在以下情况下在大型环境中使用:
 
使得强化学习功能强大的因素有两个: 使用样本来优化性能和使用函数逼近来处理大型环境。正因为有了这两个关键组件,强化学习可以在以下情况下在大型环境中使用:
   −
* A model of the environment is known, but an [[Closed-form expression|analytic solution]] is not available;
+
* A model of the environment is known, but an analytic solution is not available;
* Only a simulation model of the environment is given (the subject of [[simulation-based optimization]]);<ref>{{cite book|url = https://www.springer.com/mathematics/applications/book/978-1-4020-7454-7|title = Simulation-based Optimization: Parametric Optimization Techniques and Reinforcement|last = Gosavi|first = Abhijit|publisher = Springer|year = 2003|isbn = 978-1-4020-7454-7|pages =|ref = harv|authorlink = Abhijit Gosavi|series = Operations Research/Computer Science Interfaces Series}}</ref>
+
* Only a simulation model of the environment is given (the subject of simulation-based optimization);<ref>{{cite book|url = https://www.springer.com/mathematics/applications/book/978-1-4020-7454-7|title = Simulation-based Optimization: Parametric Optimization Techniques and Reinforcement|last = Gosavi|first = Abhijit|publisher = Springer|year = 2003|isbn = 978-1-4020-7454-7|pages =|ref = harv|authorlink = Abhijit Gosavi|series = Operations Research/Computer Science Interfaces Series}}</ref>
 
* The only way to collect information about the environment is to interact with it.
 
* The only way to collect information about the environment is to interact with it.
   第120行: 第120行:  
*收集环境信息的唯一方法就是去与环境交互;
 
*收集环境信息的唯一方法就是去与环境交互;
   −
The first two of these problems could be considered planning problems (since some form of model is available), while the last one could be considered to be a genuine learning problem. However, reinforcement learning converts both planning problems to [[machine learning]] problems.
+
The first two of these problems could be considered planning problems (since some form of model is available), while the last one could be considered to be a genuine learning problem. However, reinforcement learning converts both planning problems to machine learning problems.
    
The first two of these problems could be considered planning problems (since some form of model is available), while the last one could be considered to be a genuine learning problem. However, reinforcement learning converts both planning problems to machine learning problems.
 
The first two of these problems could be considered planning problems (since some form of model is available), while the last one could be considered to be a genuine learning problem. However, reinforcement learning converts both planning problems to machine learning problems.
第130行: 第130行:  
== 探索 ==
 
== 探索 ==
   −
The exploration vs. exploitation trade-off has been most thoroughly studied through the [[multi-armed bandit]] problem and for finite state space MDPs in Burnetas and Katehakis (1997).<ref>{{citation | last1 = Burnetas|first1 = Apostolos N.|last2 = Katehakis|first2 = Michael N.|authorlink2 = Michael N. Katehakis|year = 1997|title = Optimal adaptive policies for Markov Decision Processes|journal = Mathematics of Operations Research|volume = 22|pages = 222–255|ref =  harv|doi=10.1287/moor.22.1.222}}</ref>
+
The exploration vs. exploitation trade-off has been most thoroughly studied through the multi-armed bandit problem and for finite state space MDPs in Burnetas and Katehakis (1997).<ref>{{citation | last1 = Burnetas|first1 = Apostolos N.|last2 = Katehakis|first2 = Michael N.|authorlink2 = Michael N. Katehakis|year = 1997|title = Optimal adaptive policies for Markov Decision Processes|journal = Mathematics of Operations Research|volume = 22|pages = 222–255|ref =  harv|doi=10.1287/moor.22.1.222}}</ref>
    
The exploration vs. exploitation trade-off has been most thoroughly studied through the multi-armed bandit problem and for finite state space MDPs in Burnetas and Katehakis (1997).
 
The exploration vs. exploitation trade-off has been most thoroughly studied through the multi-armed bandit problem and for finite state space MDPs in Burnetas and Katehakis (1997).
第138行: 第138行:       −
Reinforcement learning requires clever exploration mechanisms; randomly selecting actions, without reference to an estimated probability distribution, shows poor performance. The case of (small) finite [[Markov decision process]]es is relatively well understood. However, due to the lack of algorithms that scale well with the number of states (or scale to problems with infinite state spaces), simple exploration methods are the most practical.
+
Reinforcement learning requires clever exploration mechanisms; randomly selecting actions, without reference to an estimated probability distribution, shows poor performance. The case of (small) finite Markov decision processes is relatively well understood. However, due to the lack of algorithms that scale well with the number of states (or scale to problems with infinite state spaces), simple exploration methods are the most practical.
    
Reinforcement learning requires clever exploration mechanisms; randomly selecting actions, without reference to an estimated probability distribution, shows poor performance. The case of (small) finite Markov decision processes is relatively well understood. However, due to the lack of algorithms that scale well with the number of states (or scale to problems with infinite state spaces), simple exploration methods are the most practical.
 
Reinforcement learning requires clever exploration mechanisms; randomly selecting actions, without reference to an estimated probability distribution, shows poor performance. The case of (small) finite Markov decision processes is relatively well understood. However, due to the lack of algorithms that scale well with the number of states (or scale to problems with infinite state spaces), simple exploration methods are the most practical.
第203行: 第203行:  
:<math>R=\sum_{t=0}^\infty \gamma^t r_t,</math>
 
:<math>R=\sum_{t=0}^\infty \gamma^t r_t,</math>
   −
  −
where <math>r_t</math> is the reward at step <math>t</math>, <math>\gamma \in [0,1) </math> is the [[Q-learning#Discount_factor|discount-rate]].
      
where <math>r_t</math> is the reward at step <math>t</math>, <math>\gamma \in [0,1) </math> is the discount-rate.
 
where <math>r_t</math> is the reward at step <math>t</math>, <math>\gamma \in [0,1) </math> is the discount-rate.
第219行: 第217行:     
=== 蛮力方法 ===
 
=== 蛮力方法 ===
  −
The [[brute-force search|brute force]] approach entails two steps:
      
The brute force approach entails two steps:
 
The brute force approach entails two steps:
第232行: 第228行:  
*选择具有最大期望收益的政策
 
*选择具有最大期望收益的政策
   −
  −
One problem with this is that the number of policies can be large, or even infinite. Another is that variance of the returns may be large, which requires many samples to accurately estimate the return of each policy.
      
One problem with this is that the number of policies can be large, or even infinite. Another is that variance of the returns may be large, which requires many samples to accurately estimate the return of each policy.
 
One problem with this is that the number of policies can be large, or even infinite. Another is that variance of the returns may be large, which requires many samples to accurately estimate the return of each policy.
第242行: 第236行:     
These problems can be ameliorated if we assume some structure and allow samples generated from one policy to influence the estimates made for others. The two main approaches for achieving this are [[#Value function|value function estimation]] and [[#Direct policy search|direct policy search]].
 
These problems can be ameliorated if we assume some structure and allow samples generated from one policy to influence the estimates made for others. The two main approaches for achieving this are [[#Value function|value function estimation]] and [[#Direct policy search|direct policy search]].
  −
These problems can be ameliorated if we assume some structure and allow samples generated from one policy to influence the estimates made for others. The two main approaches for achieving this are value function estimation and direct policy search.
      
如果我们假设某些结构,并允许从一个策略中生成的样本影响为其他策略所做的估计,那么这些问题可以得到改善。实现这一目标的两种主要方法是价值函数估计和直接策略搜索。
 
如果我们假设某些结构,并允许从一个策略中生成的样本影响为其他策略所做的估计,那么这些问题可以得到改善。实现这一目标的两种主要方法是价值函数估计和直接策略搜索。
第251行: 第243行:  
=== 价值函数 ===
 
=== 价值函数 ===
 
{{see also|Value function}}
 
{{see also|Value function}}
  −
Value function approaches attempt to find a policy that maximizes the return by maintaining a set of estimates of expected returns for some policy (usually either the "current" [on-policy] or the optimal [off-policy] one).
      
Value function approaches attempt to find a policy that maximizes the return by maintaining a set of estimates of expected returns for some policy (usually either the "current" [on-policy] or the optimal [off-policy] one).
 
Value function approaches attempt to find a policy that maximizes the return by maintaining a set of estimates of expected returns for some policy (usually either the "current" [on-policy] or the optimal [off-policy] one).
第261行: 第251行:     
These methods rely on the theory of MDPs, where optimality is defined in a sense that is stronger than the above one: A policy is called optimal if it achieves the best expected return from ''any'' initial state (i.e., initial distributions play no role in this definition). Again, an optimal policy can always be found amongst stationary policies.
 
These methods rely on the theory of MDPs, where optimality is defined in a sense that is stronger than the above one: A policy is called optimal if it achieves the best expected return from ''any'' initial state (i.e., initial distributions play no role in this definition). Again, an optimal policy can always be found amongst stationary policies.
  −
These methods rely on the theory of MDPs, where optimality is defined in a sense that is stronger than the above one: A policy is called optimal if it achieves the best expected return from any initial state (i.e., initial distributions play no role in this definition). Again, an optimal policy can always be found amongst stationary policies.
      
这些方法依赖于 MDPs 理论,其中最优性的定义强于上述各种定义: 如果一个策略从任何初始状态都能获得最佳预期收益(即,初始分配在这个定义中没有任何作用) ,则称其为最优策略。同样,在平稳策略中总能找到最优策略。
 
这些方法依赖于 MDPs 理论,其中最优性的定义强于上述各种定义: 如果一个策略从任何初始状态都能获得最佳预期收益(即,初始分配在这个定义中没有任何作用) ,则称其为最优策略。同样,在平稳策略中总能找到最优策略。
      −
  −
To define optimality in a formal manner, define the value of a policy <math>\pi</math> by
      
To define optimality in a formal manner, define the value of a policy <math>\pi</math> by
 
To define optimality in a formal manner, define the value of a policy <math>\pi</math> by
第332行: 第318行:  
====蒙特卡罗方法====
 
====蒙特卡罗方法====
   −
[[Monte Carlo sampling|Monte Carlo methods]] can be used in an algorithm that mimics policy iteration. Policy iteration consists of two steps: ''policy evaluation'' and ''policy improvement''.
+
Monte Carlo methods can be used in an algorithm that mimics policy iteration. Policy iteration consists of two steps: ''policy evaluation'' and ''policy improvement''.
    
Monte Carlo methods can be used in an algorithm that mimics policy iteration. Policy iteration consists of two steps: policy evaluation and policy improvement.
 
Monte Carlo methods can be used in an algorithm that mimics policy iteration. Policy iteration consists of two steps: policy evaluation and policy improvement.
第345行: 第331行:  
蒙特卡罗方法被用在了策略评估步骤。在这个步骤中,给定一个平稳的、确定性的策略<math>\pi</math>,目标是计算所有状态动作对<math>(s,a)</math>的函数值<math>Q^\pi(s,a)</math>(或者一个很好的近似值)。(为简单起见)假设 MDP 是有限的,且计算机有足够的内存来容纳动作值,问题是分阶段的的,每阶段之后一个新阶段从某个随机的初始状态开始。然后,通过对来自<math>(s,a)</math>随时间变化的抽样收益进行平均,可以计算出给定状态动作对<math>(s,a)</math>的值。给定足够的时间,这个过程可以构造出动作值函数  <math>Q^\pi</math>的一个精确的估计<math>Q</math>。
 
蒙特卡罗方法被用在了策略评估步骤。在这个步骤中,给定一个平稳的、确定性的策略<math>\pi</math>,目标是计算所有状态动作对<math>(s,a)</math>的函数值<math>Q^\pi(s,a)</math>(或者一个很好的近似值)。(为简单起见)假设 MDP 是有限的,且计算机有足够的内存来容纳动作值,问题是分阶段的的,每阶段之后一个新阶段从某个随机的初始状态开始。然后,通过对来自<math>(s,a)</math>随时间变化的抽样收益进行平均,可以计算出给定状态动作对<math>(s,a)</math>的值。给定足够的时间,这个过程可以构造出动作值函数  <math>Q^\pi</math>的一个精确的估计<math>Q</math>。
   −
  −
  −
In the policy improvement step, the next policy is obtained by computing a ''greedy'' policy with respect to <math>Q</math>: Given a state <math>s</math>, this new policy returns an action that maximizes <math>Q(s,\cdot)</math>. In practice [[lazy evaluation]] can defer the computation of the maximizing actions to when they are needed.
      
In the policy improvement step, the next policy is obtained by computing a greedy policy with respect to <math>Q</math>: Given a state <math>s</math>, this new policy returns an action that maximizes <math>Q(s,\cdot)</math>. In practice lazy evaluation can defer the computation of the maximizing actions to when they are needed.
 
In the policy improvement step, the next policy is obtained by computing a greedy policy with respect to <math>Q</math>: Given a state <math>s</math>, this new policy returns an action that maximizes <math>Q(s,\cdot)</math>. In practice lazy evaluation can defer the computation of the maximizing actions to when they are needed.
第376行: 第359行:     
====时序差分方法====
 
====时序差分方法====
  −
{{Main|Temporal difference learning}}
      
The first problem is corrected by allowing the procedure to change the policy (at some or all states) before the values settle. This too may be problematic as it might prevent convergence. Most current algorithms do this, giving rise to the class of ''generalized policy iteration'' algorithms. Many ''actor critic'' methods belong to this category.
 
The first problem is corrected by allowing the procedure to change the policy (at some or all states) before the values settle. This too may be problematic as it might prevent convergence. Most current algorithms do this, giving rise to the class of ''generalized policy iteration'' algorithms. Many ''actor critic'' methods belong to this category.
第421行: 第402行:  
:<math>Q(s,a) = \sum_{i=1}^d \theta_i \phi_i(s,a).</math>
 
:<math>Q(s,a) = \sum_{i=1}^d \theta_i \phi_i(s,a).</math>
   −
  −
The algorithms then adjust the weights, instead of adjusting the values associated with the individual state-action pairs. Methods based on ideas from [[nonparametric statistics]] (which can be seen to construct their own features) have been explored.
      
The algorithms then adjust the weights, instead of adjusting the values associated with the individual state-action pairs. Methods based on ideas from nonparametric statistics (which can be seen to construct their own features) have been explored.
 
The algorithms then adjust the weights, instead of adjusting the values associated with the individual state-action pairs. Methods based on ideas from nonparametric statistics (which can be seen to construct their own features) have been explored.
第430行: 第409行:       −
Value iteration can also be used as a starting point, giving rise to the [[Q-learning]] algorithm and its many variants.<ref>{{cite thesis
+
Value iteration can also be used as a starting point, giving rise to the Q-learning algorithm and its many variants.<ref>{{cite thesis
 
   | last = Watkins | first = Christopher J.C.H. | authorlink = Christopher J.C.H. Watkins
 
   | last = Watkins | first = Christopher J.C.H. | authorlink = Christopher J.C.H. Watkins
 
   | degree= PhD
 
   | degree= PhD
第455行: 第434行:     
=== 直接策略搜索 ===
 
=== 直接策略搜索 ===
  −
An alternative method is to search directly in (some subset of) the policy space, in which case the problem becomes a case of [[stochastic optimization]]. The two approaches available are gradient-based and gradient-free methods.
      
An alternative method is to search directly in (some subset of) the policy space, in which case the problem becomes a case of stochastic optimization. The two approaches available are gradient-based and gradient-free methods.
 
An alternative method is to search directly in (some subset of) the policy space, in which case the problem becomes a case of stochastic optimization. The two approaches available are gradient-based and gradient-free methods.
第462行: 第439行:  
另一种方法是直接在策略空间(的子集)中搜索。使用这种方法,问题就变成了随机优化问题。现有的两种方法是基于梯度的方法和无梯度的方法。
 
另一种方法是直接在策略空间(的子集)中搜索。使用这种方法,问题就变成了随机优化问题。现有的两种方法是基于梯度的方法和无梯度的方法。
   −
  −
  −
[[Gradient]]-based methods (''policy gradient methods'') start with a mapping from a finite-dimensional (parameter) space to the space of policies: given the parameter vector <math>\theta</math>, let <math>\pi_\theta</math> denote the policy associated to <math>\theta</math>. Defining the performance function by
      
Gradient-based methods (policy gradient methods) start with a mapping from a finite-dimensional (parameter) space to the space of policies: given the parameter vector <math>\theta</math>, let <math>\pi_\theta</math> denote the policy associated to <math>\theta</math>. Defining the performance function by
 
Gradient-based methods (policy gradient methods) start with a mapping from a finite-dimensional (parameter) space to the space of policies: given the parameter vector <math>\theta</math>, let <math>\pi_\theta</math> denote the policy associated to <math>\theta</math>. Defining the performance function by
第474行: 第448行:       −
under mild conditions this function will be differentiable as a function of the parameter vector <math>\theta</math>. If the gradient of <math>\rho</math> was known, one could use [[gradient descent|gradient ascent]]. Since an analytic expression for the gradient is not available, only a noisy estimate is available. Such an estimate can be constructed in many ways, giving rise to algorithms such as Williams' REINFORCE method<ref>{{cite conference
+
under mild conditions this function will be differentiable as a function of the parameter vector <math>\theta</math>. If the gradient of <math>\rho</math> was known, one could use gradient ascent. Since an analytic expression for the gradient is not available, only a noisy estimate is available. Such an estimate can be constructed in many ways, giving rise to algorithms such as Williams' REINFORCE method<ref>{{cite conference
 
   | last = Williams | first = Ronald J. | authorlink = Ronald J. Williams  
 
   | last = Williams | first = Ronald J. | authorlink = Ronald J. Williams  
 
   | title = A class of gradient-estimating algorithms for reinforcement learning in neural networks
 
   | title = A class of gradient-estimating algorithms for reinforcement learning in neural networks
 
   | booktitle = Proceedings of the IEEE First International Conference on Neural Networks
 
   | booktitle = Proceedings of the IEEE First International Conference on Neural Networks
   | year = 1987| citeseerx = 10.1.1.129.8871 }}</ref> (which is known as the likelihood ratio method in the [[simulation-based optimization]] literature).<ref>{{cite conference
+
   | year = 1987| citeseerx = 10.1.1.129.8871 }}</ref> (which is known as the likelihood ratio method in the simulation-based optimization literature).<ref>{{cite conference
 
   | last1 = Peters | first1 = Jan | authorlink1 = Jan Peters (researcher)
 
   | last1 = Peters | first1 = Jan | authorlink1 = Jan Peters (researcher)
 
   | last2 = Vijayakumar | first2 = Sethu | authorlink2 = Sethu Vijayakumar
 
   | last2 = Vijayakumar | first2 = Sethu | authorlink2 = Sethu Vijayakumar
第485行: 第459行:  
   | booktitle = IEEE-RAS International Conference on Humanoid Robots
 
   | booktitle = IEEE-RAS International Conference on Humanoid Robots
 
   | year = 2003
 
   | year = 2003
   | url = http://www-clmc.usc.edu/publications/p/peters-ICHR2003.pdf}}</ref> Policy search methods have been used in the [[robotics]] context.<ref>{{Cite book|title = A Survey on Policy Search for Robotics|last1 = Deisenroth|first1 = Marc Peter|last2 = Neumann|first2 = Gerhard|last3 = Peters|first3 = Jan|publisher = NOW Publishers|year = 2013|series = Foundations and Trends in Robotics|volume = 2|issue = 1–2|pages = 1–142 |authorlink1 = Marc Peter Deisenroth|authorlink2 = Gerhard Neumann|authorlink3 = Jan Peters (researcher)|hdl = 10044/1/12051|doi = 10.1561/2300000021|url = http://eprints.lincoln.ac.uk/28029/1/PolicySearchReview.pdf}}</ref> Many policy search methods may get stuck in local optima (as they are based on [[Local search (optimization)|local search]]).
+
   | url = http://www-clmc.usc.edu/publications/p/peters-ICHR2003.pdf}}</ref> Policy search methods have been used in the [[robotics]] context.<ref>{{Cite book|title = A Survey on Policy Search for Robotics|last1 = Deisenroth|first1 = Marc Peter|last2 = Neumann|first2 = Gerhard|last3 = Peters|first3 = Jan|publisher = NOW Publishers|year = 2013|series = Foundations and Trends in Robotics|volume = 2|issue = 1–2|pages = 1–142 |authorlink1 = Marc Peter Deisenroth|authorlink2 = Gerhard Neumann|authorlink3 = Jan Peters (researcher)|hdl = 10044/1/12051|doi = 10.1561/2300000021|url = http://eprints.lincoln.ac.uk/28029/1/PolicySearchReview.pdf}}</ref> Many policy search methods may get stuck in local optima (as they are based on local search).
    
通常情况下,这个函数是关于参数<math>\theta</math>可微的。如果已知参数<math>\rho</math>的梯度,则可以直接使用梯度下降法。但是,我们不能得到关于梯度的解析式,只能得到一个有噪的估计量。这个估计量可以通过许多方法来构建,由此诞生了像REINFORCE这样的算法。<ref>{{cite conference
 
通常情况下,这个函数是关于参数<math>\theta</math>可微的。如果已知参数<math>\rho</math>的梯度,则可以直接使用梯度下降法。但是,我们不能得到关于梯度的解析式,只能得到一个有噪的估计量。这个估计量可以通过许多方法来构建,由此诞生了像REINFORCE这样的算法。<ref>{{cite conference
第491行: 第465行:  
   | title = A class of gradient-estimating algorithms for reinforcement learning in neural networks
 
   | title = A class of gradient-estimating algorithms for reinforcement learning in neural networks
 
   | booktitle = Proceedings of the IEEE First International Conference on Neural Networks
 
   | booktitle = Proceedings of the IEEE First International Conference on Neural Networks
   | year = 1987| citeseerx = 10.1.1.129.8871 }}</ref> (which is known as the likelihood ratio method in the [[simulation-based optimization]] literature).<ref>{{cite conference
+
   | year = 1987| citeseerx = 10.1.1.129.8871 }}</ref> (which is known as the likelihood ratio method in the simulation-based optimization literature).<ref>{{cite conference
 
   | last1 = Peters | first1 = Jan | authorlink1 = Jan Peters (researcher)
 
   | last1 = Peters | first1 = Jan | authorlink1 = Jan Peters (researcher)
 
   | last2 = Vijayakumar | first2 = Sethu | authorlink2 = Sethu Vijayakumar
 
   | last2 = Vijayakumar | first2 = Sethu | authorlink2 = Sethu Vijayakumar
第498行: 第472行:  
   | booktitle = IEEE-RAS International Conference on Humanoid Robots
 
   | booktitle = IEEE-RAS International Conference on Humanoid Robots
 
   | year = 2003
 
   | year = 2003
   | url = http://www-clmc.usc.edu/publications/p/peters-ICHR2003.pdf}}</ref> Policy search methods have been used in the [[robotics]] context.<ref>{{Cite book|title = A Survey on Policy Search for Robotics|last1 = Deisenroth|first1 = Marc Peter|last2 = Neumann|first2 = Gerhard|last3 = Peters|first3 = Jan|publisher = NOW Publishers|year = 2013|series = Foundations and Trends in Robotics|volume = 2|issue = 1–2|pages = 1–142 |authorlink1 = Marc Peter Deisenroth|authorlink2 = Gerhard Neumann|authorlink3 = Jan Peters (researcher)|hdl = 10044/1/12051|doi = 10.1561/2300000021|url = http://eprints.lincoln.ac.uk/28029/1/PolicySearchReview.pdf}}</ref>
+
   | url = http://www-clmc.usc.edu/publications/p/peters-ICHR2003.pdf}}</ref> Policy search methods have been used in the robotics context.<ref>{{Cite book|title = A Survey on Policy Search for Robotics|last1 = Deisenroth|first1 = Marc Peter|last2 = Neumann|first2 = Gerhard|last3 = Peters|first3 = Jan|publisher = NOW Publishers|year = 2013|series = Foundations and Trends in Robotics|volume = 2|issue = 1–2|pages = 1–142 |authorlink1 = Marc Peter Deisenroth|authorlink2 = Gerhard Neumann|authorlink3 = Jan Peters (researcher)|hdl = 10044/1/12051|doi = 10.1561/2300000021|url = http://eprints.lincoln.ac.uk/28029/1/PolicySearchReview.pdf}}</ref>
 
许多策略搜索方法会陷入局部最优点(有哪位他们本就是基于局部搜索的优化方法)
 
许多策略搜索方法会陷入局部最优点(有哪位他们本就是基于局部搜索的优化方法)
      −
A large class of methods avoids relying on gradient information. These include [[simulated annealing]], [[cross-entropy method|cross-entropy search]] or methods of [[evolutionary computation]]. Many gradient-free methods can achieve (in theory and in the limit) a global optimum.
+
A large class of methods avoids relying on gradient information. These include [[simulated annealing]], cross-entropy search or methods of evolutionary computation. Many gradient-free methods can achieve (in theory and in the limit) a global optimum.
    
A large class of methods avoids relying on gradient information. These include simulated annealing, cross-entropy search or methods of evolutionary computation. Many gradient-free methods can achieve (in theory and in the limit) a global optimum.
 
A large class of methods avoids relying on gradient information. These include simulated annealing, cross-entropy search or methods of evolutionary computation. Many gradient-free methods can achieve (in theory and in the limit) a global optimum.
第552行: 第526行:  
* addressing the exploration problem in large MDPs
 
* addressing the exploration problem in large MDPs
 
* large-scale empirical evaluations
 
* large-scale empirical evaluations
* learning and acting under [[Partially observable Markov decision process|partial information]] (e.g., using [[predictive state representation]])
+
* learning and acting under partial information (e.g., using predictive state representation)
 
* modular and hierarchical reinforcement learning<ref>{{Cite journal|last=Kulkarni|first=Tejas D.|last2=Narasimhan|first2=Karthik R.|last3=Saeedi|first3=Ardavan|last4=Tenenbaum|first4=Joshua B.|date=2016|title=Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation|url=http://dl.acm.org/citation.cfm?id=3157382.3157509|journal=Proceedings of the 30th International Conference on Neural Information Processing Systems|series=NIPS'16|location=USA|publisher=Curran Associates Inc.|pages=3682–3690|isbn=978-1-5108-3881-9|bibcode=2016arXiv160406057K|arxiv=1604.06057}}</ref>
 
* modular and hierarchical reinforcement learning<ref>{{Cite journal|last=Kulkarni|first=Tejas D.|last2=Narasimhan|first2=Karthik R.|last3=Saeedi|first3=Ardavan|last4=Tenenbaum|first4=Joshua B.|date=2016|title=Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation|url=http://dl.acm.org/citation.cfm?id=3157382.3157509|journal=Proceedings of the 30th International Conference on Neural Information Processing Systems|series=NIPS'16|location=USA|publisher=Curran Associates Inc.|pages=3682–3690|isbn=978-1-5108-3881-9|bibcode=2016arXiv160406057K|arxiv=1604.06057}}</ref>
 
* improving existing value-function and policy search methods
 
* improving existing value-function and policy search methods
 
* algorithms that work well with large (or continuous) action spaces
 
* algorithms that work well with large (or continuous) action spaces
* [[transfer learning]]<ref>{{Cite journal|last=George Karimpanal|first=Thommen|last2=Bouffanais|first2=Roland|date=2019|title=Self-organizing maps for storage and transfer of knowledge in reinforcement learning|journal=Adaptive Behavior|language=en|volume=27|issue=2|pages=111–126|doi=10.1177/1059712318818568|issn=1059-7123|arxiv=1811.08318}}</ref>
+
* transfer learning<ref>{{Cite journal|last=George Karimpanal|first=Thommen|last2=Bouffanais|first2=Roland|date=2019|title=Self-organizing maps for storage and transfer of knowledge in reinforcement learning|journal=Adaptive Behavior|language=en|volume=27|issue=2|pages=111–126|doi=10.1177/1059712318818568|issn=1059-7123|arxiv=1811.08318}}</ref>
 
* lifelong learning
 
* lifelong learning
 
* efficient sample-based planning (e.g., based on [[Monte Carlo tree search]]).  
 
* efficient sample-based planning (e.g., based on [[Monte Carlo tree search]]).  
 
*bug detection in software projects<ref>{{Cite web|url=https://cie.acm.org/articles/use-reinforcements-learning-testing-game-mechanics/|title=On the Use of Reinforcement Learning for Testing Game Mechanics : ACM - Computers in Entertainment|website=cie.acm.org|language=en|access-date=2018-11-27}}</ref>
 
*bug detection in software projects<ref>{{Cite web|url=https://cie.acm.org/articles/use-reinforcements-learning-testing-game-mechanics/|title=On the Use of Reinforcement Learning for Testing Game Mechanics : ACM - Computers in Entertainment|website=cie.acm.org|language=en|access-date=2018-11-27}}</ref>
* [[Intrinsic motivation (artificial intelligence)|Intrinsic motivation]] which differentiates information-seeking, curiosity-type behaviours from task-dependent goal-directed behaviours (typically) by introducing a reward function based on maximising novel information<ref name=kaplan2004>Kaplan, F. and Oudeyer, P. (2004). Maximizing learning progress: an internal reward system for development. Embodied artificial intelligence, pages 629–629.</ref><ref name=klyubin2008>Klyubin, A., Polani, D., and Nehaniv, C. (2008). Keep your options open: an information-based driving principle for sensorimotor systems. PloS ONE, 3(12):e4018. https://dx.doi.org/10.1371%2Fjournal.pone.0004018</ref><ref name=barto2013>Barto, A. G. (2013). “Intrinsic motivation and reinforcement learning,” in Intrinsically Motivated Learning in Natural and Artificial Systems (Berlin; Heidelberg: Springer), 17–47</ref>
+
* Intrinsic motivation which differentiates information-seeking, curiosity-type behaviours from task-dependent goal-directed behaviours (typically) by introducing a reward function based on maximising novel information<ref name=kaplan2004>Kaplan, F. and Oudeyer, P. (2004). Maximizing learning progress: an internal reward system for development. Embodied artificial intelligence, pages 629–629.</ref><ref name=klyubin2008>Klyubin, A., Polani, D., and Nehaniv, C. (2008). Keep your options open: an information-based driving principle for sensorimotor systems. PloS ONE, 3(12):e4018. https://dx.doi.org/10.1371%2Fjournal.pone.0004018</ref><ref name=barto2013>Barto, A. G. (2013). “Intrinsic motivation and reinforcement learning,” in Intrinsically Motivated Learning in Natural and Artificial Systems (Berlin; Heidelberg: Springer), 17–47</ref>
 
*Multiagent or distributed reinforcement learning is a topic of interest. Applications are expanding.<ref>{{Cite web|url=http://umichrl.pbworks.com/Successes-of-Reinforcement-Learning/|title=Reinforcement Learning / Successes of Reinforcement Learning|website=umichrl.pbworks.com|access-date=2017-08-06}}</ref>
 
*Multiagent or distributed reinforcement learning is a topic of interest. Applications are expanding.<ref>{{Cite web|url=http://umichrl.pbworks.com/Successes-of-Reinforcement-Learning/|title=Reinforcement Learning / Successes of Reinforcement Learning|website=umichrl.pbworks.com|access-date=2017-08-06}}</ref>
 
* Actor-critic reinforcement learning  
 
* Actor-critic reinforcement learning  
* Reinforcement learning algorithms such as TD learning are under investigation as a model for [[dopamine]]-based learning in the brain. In this model, the [[dopaminergic]] projections from the [[substantia nigra]] to the [[basal ganglia]] function as the prediction error. Reinforcement learning has been used as a part of the model for human skill learning, especially in relation to the interaction between implicit and explicit learning in skill acquisition (the first publication on this application was in 1995–1996).<ref>[http://incompleteideas.net/sutton/RL-FAQ.html#behaviorism]  {{webarchive|url=https://web.archive.org/web/20170426212431/http://incompleteideas.net/sutton/RL-FAQ.html|date=2017-04-26}}</ref>
+
* Reinforcement learning algorithms such as TD learning are under investigation as a model for dopamine-based learning in the brain. In this model, the dopaminergic projections from the substantia nigra to the basal ganglia function as the prediction error. Reinforcement learning has been used as a part of the model for human skill learning, especially in relation to the interaction between implicit and explicit learning in skill acquisition (the first publication on this application was in 1995–1996).<ref>[http://incompleteideas.net/sutton/RL-FAQ.html#behaviorism]  {{webarchive|url=https://web.archive.org/web/20170426212431/http://incompleteideas.net/sutton/RL-FAQ.html|date=2017-04-26}}</ref>
      第588行: 第562行:  
! Algorithm !! Description !! Model !! Policy !! Action Space !! State Space !! Operator
 
! Algorithm !! Description !! Model !! Policy !! Action Space !! State Space !! Operator
 
|-
 
|-
| [[Monte Carlo method|Monte Carlo]] || Every visit to Monte Carlo || [[Model-free (reinforcement learning)|Model-Free]] || Either || Discrete || Discrete || Sample-means
+
| [[Monte Carlo method|Monte Carlo]] || Every visit to Monte Carlo || Model-Free || Either || Discrete || Discrete || Sample-means
 
|-
 
|-
| [[Q-learning]] || State–action–reward–state || Model-Free || Off-policy || Discrete || Discrete || Q-value
+
| Q-learning || State–action–reward–state || Model-Free || Off-policy || Discrete || Discrete || Q-value
 
|-
 
|-
| [[State–action–reward–state–action|SARSA]] || State–action–reward–state–action || Model-Free || On-policy || Discrete || Discrete || Q-value
+
| SARSA || State–action–reward–state–action || Model-Free || On-policy || Discrete || Discrete || Q-value
 
|-
 
|-
| [[Q-learning]] - Lambda || State–action–reward–state with eligibility traces|| Model-Free || Off-policy || Discrete || Discrete || Q-value
+
| Q-learning - Lambda || State–action–reward–state with eligibility traces|| Model-Free || Off-policy || Discrete || Discrete || Q-value
 
|-
 
|-
| [[State–action–reward–state–action|SARSA]] - Lambda || State–action–reward–state–action with eligibility traces || Model-Free || On-policy || Discrete || Discrete || Q-value
+
| SARSA - Lambda || State–action–reward–state–action with eligibility traces || Model-Free || On-policy || Discrete || Discrete || Q-value
 
|-
 
|-
| [[Q-learning#Deep Q-learning|DQN]] || Deep Q Network || Model-Free || Off-policy || Discrete || Continuous || Q-value
+
| DQN || Deep Q Network || Model-Free || Off-policy || Discrete || Continuous || Q-value
 
|-
 
|-
 
| DDPG || Deep Deterministic Policy Gradient || Model-Free || Off-policy || Continuous || Continuous || Q-value
 
| DDPG || Deep Deterministic Policy Gradient || Model-Free || Off-policy || Continuous || Continuous || Q-value
第608行: 第582行:  
| TRPO || Trust Region Policy Optimization || Model-Free || On-policy || Continuous || Continuous || Advantage
 
| TRPO || Trust Region Policy Optimization || Model-Free || On-policy || Continuous || Continuous || Advantage
 
|-
 
|-
| [[Proximal Policy Optimization|PPO]] || Proximal Policy Optimization || Model-Free || On-policy || Continuous || Continuous || Advantage
+
| PPO || Proximal Policy Optimization || Model-Free || On-policy || Continuous || Continuous || Advantage
 
|-
 
|-
 
|TD3
 
|TD3
第632行: 第606行:  
! 算法 !! 描述 !! 模型 !! 策略 !! 行动空间 !! 状态空间 !! 操作符
 
! 算法 !! 描述 !! 模型 !! 策略 !! 行动空间 !! 状态空间 !! 操作符
 
|-
 
|-
| [[蒙特卡洛方法|蒙特卡洛]] || 采样 || 无模型 || Either || 离散 || 离散 || 样本均值
+
| 蒙特卡洛 || 采样 || 无模型 || Either || 离散 || 离散 || 样本均值
 
|-
 
|-
| [[Q-learning]] || 状态–行动–奖励–状态 || 无模型 || Off || 离散 || 离散 || Q值
+
| Q-learning || 状态–行动–奖励–状态 || 无模型 || Off || 离散 || 离散 || Q值
 
|-
 
|-
| [[状态–行动–奖励–状态-行动|SARSA]] || 状态–行动–奖励–状态-行动 || 无模型 || On || 离散 || 离散 || Q值
+
| SARSA || 状态–行动–奖励–状态-行动 || 无模型 || On || 离散 || 离散 || Q值
 
|-
 
|-
| [[Q-learning]] - Lambda || 带资格痕迹的状态–行动–奖励–状态 || 无模型 || Off || 离散 || 离散 || Q值
+
| Q-learning - Lambda || 带资格痕迹的状态–行动–奖励–状态 || 无模型 || Off || 离散 || 离散 || Q值
 
|-
 
|-
| [[State–action–reward–state–action|SARSA]] - Lambda || 带资格痕迹的状态–行动–奖励–状态-行动 || 无模型 || On || 离散 || 离散 || Q值
+
| SARSA - Lambda || 带资格痕迹的状态–行动–奖励–状态-行动 || 无模型 || On || 离散 || 离散 || Q值
 
|-
 
|-
| [[Q-learning#Deep Q-learning|DQN]] || 深度Q网络 || 无模型 || Off || 离散 || 连续 || Q值
+
| DQN || 深度Q网络 || 无模型 || Off || 离散 || 连续 || Q值
 
|-
 
|-
 
| DDPG || 深度确定性策略梯度 || 无模型 || Off || 连续 || 连续 || Q值
 
| DDPG || 深度确定性策略梯度 || 无模型 || Off || 连续 || 连续 || Q值
第652行: 第626行:  
| TRPO || 信任区域策略的最优化 || 无模型 || On || 连续 || 连续 || 优势
 
| TRPO || 信任区域策略的最优化 || 无模型 || On || 连续 || 连续 || 优势
 
|-
 
|-
| [[Proximal Policy Optimization|PPO]] || 近端策略最优化 || 无模型 || On || 连续 || 连续 || 优势
+
| PPO || 近端策略最优化 || 无模型 || On || 连续 || 连续 || 优势
 
|-
 
|-
 
|TD3
 
|TD3
第675行: 第649行:  
=== 深度强度学习 ===
 
=== 深度强度学习 ===
   −
This approach extends reinforcement learning by using a deep neural network and without explicitly designing the state space.<ref name="intro_deep_RL">{{cite journal |first= Vincent|display-authors=etal|last= Francois-Lavet |year=2018 |title= An Introduction to Deep Reinforcement Learning |journal=Foundations and Trends in Machine Learning|volume=11 |issue=3–4 |pages=219–354 |doi=10.1561/2200000071|arxiv= 1811.12560 |bibcode=2018arXiv181112560F}}</ref> The work on learning ATARI games by Google [[DeepMind]] increased attention to [[deep reinforcement learning]] or [[end-to-end reinforcement learning]].<ref name="DQN2">{{cite journal |first= Volodymyr|display-authors=etal|last= Mnih |year=2015 |title= Human-level control through deep reinforcement learning |journal=Nature|volume=518 |issue=7540 |pages=529–533 |doi=10.1038/nature14236|pmid= 25719670 |bibcode=2015Natur.518..529M |url=https://www.semanticscholar.org/paper/e0e9a94c4a6ba219e768b4e59f72c18f0a22e23d}}</ref>
+
This approach extends reinforcement learning by using a deep neural network and without explicitly designing the state space.<ref name="intro_deep_RL">{{cite journal |first= Vincent|display-authors=etal|last= Francois-Lavet |year=2018 |title= An Introduction to Deep Reinforcement Learning |journal=Foundations and Trends in Machine Learning|volume=11 |issue=3–4 |pages=219–354 |doi=10.1561/2200000071|arxiv= 1811.12560 |bibcode=2018arXiv181112560F}}</ref> The work on learning ATARI games by Google DeepMind increased attention to deep reinforcement learning or end-to-end reinforcement learning.<ref name="DQN2">{{cite journal |first= Volodymyr|display-authors=etal|last= Mnih |year=2015 |title= Human-level control through deep reinforcement learning |journal=Nature|volume=518 |issue=7540 |pages=529–533 |doi=10.1038/nature14236|pmid= 25719670 |bibcode=2015Natur.518..529M |url=https://www.semanticscholar.org/paper/e0e9a94c4a6ba219e768b4e59f72c18f0a22e23d}}</ref>
    
This approach extends reinforcement learning by using a deep neural network and without explicitly designing the state space. The work on learning ATARI games by Google DeepMind increased attention to deep reinforcement learning or end-to-end reinforcement learning.
 
This approach extends reinforcement learning by using a deep neural network and without explicitly designing the state space. The work on learning ATARI games by Google DeepMind increased attention to deep reinforcement learning or end-to-end reinforcement learning.
第695行: 第669行:  
=== 学徒学习 ===
 
=== 学徒学习 ===
   −
In [[apprenticeship learning]], an expert demonstrates the target behavior. The system tries to recover the policy via observation.
+
In apprenticeship learning, an expert demonstrates the target behavior. The system tries to recover the policy via observation.
    
In apprenticeship learning, an expert demonstrates the target behavior. The system tries to recover the policy via observation.
 
In apprenticeship learning, an expert demonstrates the target behavior. The system tries to recover the policy via observation.
1,068

个编辑

导航菜单