| 更正式地说,环境被建模成【马尔科夫决策过程】 (MDP),具有如下概率分布的状态 <math>\textstyle {s_1,...,s_n}\in S </math>和动作 <math>\textstyle {a_1,...,a_m} \in A</math> :瞬时损失分布 <math>\textstyle P(c_t|s_t)</math>,观测分布 <math>\textstyle P(x_t|s_t)</math> 和转移 <math>\textstyle P(s_{t+1}|s_t, a_t)</math>, 方针被定义为给定观测值的动作上的条件分布。合起来,这二者定义了一个【马尔科夫链】(MC)。目标是找到最小化损失的方针(也就是MC)。 | | 更正式地说,环境被建模成【马尔科夫决策过程】 (MDP),具有如下概率分布的状态 <math>\textstyle {s_1,...,s_n}\in S </math>和动作 <math>\textstyle {a_1,...,a_m} \in A</math> :瞬时损失分布 <math>\textstyle P(c_t|s_t)</math>,观测分布 <math>\textstyle P(x_t|s_t)</math> 和转移 <math>\textstyle P(s_{t+1}|s_t, a_t)</math>, 方针被定义为给定观测值的动作上的条件分布。合起来,这二者定义了一个【马尔科夫链】(MC)。目标是找到最小化损失的方针(也就是MC)。 |
− | ANNs are frequently used in reinforcement learning as part of the overall algorithm.<ref>{{cite conference| author = Dominic, S. |author2=Das, R. |author3=Whitley, D. |author4=Anderson, C. |date=July 1991 | title = Genetic reinforcement learning for neural networks | conference = IJCNN-91-Seattle International Joint Conference on Neural Networks | booktitle = IJCNN-91-Seattle International Joint Conference on Neural Networks | publisher = IEEE | location = Seattle, Washington, USA | doi = 10.1109/IJCNN.1991.155315 | accessdate = | isbn = 0-7803-0164-1 }}</ref><ref>{{cite journal |last=Hoskins |first=J.C. |author2=Himmelblau, D.M. |title=Process control via artificial neural networks and reinforcement learning |journal=Computers & Chemical Engineering |year=1992 |volume=16 |pages=241–251 |doi=10.1016/0098-1354(92)80045-B |issue=4}}</ref> [[Dynamic programming]] was coupled with ANNs (giving neurodynamic programming) by [[Dimitri Bertsekas|Bertsekas]] and Tsitsiklis<ref>{{cite book|url=https://papers.nips.cc/paper/4741-deep-neural-networks-segment-neuronal-membranes-in-electron-microscopy-images|title=Neuro-dynamic programming|first=D.P.|first2=J.N.|publisher=Athena Scientific|year=1996|isbn=1-886529-10-8|location=|page=512|pages=|author=Bertsekas|author2=Tsitsiklis}}</ref> and applied to multi-dimensional nonlinear problems such as those involved in [[vehicle routing]],<ref>{{cite journal |last=Secomandi |first=Nicola |title=Comparing neuro-dynamic programming algorithms for the vehicle routing problem with stochastic demands |journal=Computers & Operations Research |year=2000 |volume=27 |pages=1201–1225 |doi=10.1016/S0305-0548(99)00146-X |issue=11–12}}</ref> [[natural resource management|natural resources management]]<ref>{{cite conference| author = de Rigo, D. |author2=Rizzoli, A. E. |author3=Soncini-Sessa, R. |author4=Weber, E. |author5=Zenesi, P. | year = 2001 | title = Neuro-dynamic programming for the efficient management of reservoir networks | conference = MODSIM 2001, International Congress on Modelling and Simulation | conferenceurl = http://www.mssanz.org.au/MODSIM01/MODSIM01.htm | booktitle = Proceedings of MODSIM 2001, International Congress on Modelling and Simulation | publisher = Modelling and Simulation Society of Australia and New Zealand | location = Canberra, Australia | doi = 10.5281/zenodo.7481 | url = https://zenodo.org/record/7482/files/de_Rigo_etal_MODSIM2001_activelink_authorcopy.pdf | accessdate = 29 July 2012 | isbn = 0-867405252 }}</ref><ref>{{cite conference| author = Damas, M. |author2=Salmeron, M. |author3=Diaz, A. |author4=Ortega, J. |author5=Prieto, A. |author6=Olivares, G.| year = 2000 | title = Genetic algorithms and neuro-dynamic programming: application to water supply networks | conference = 2000 Congress on Evolutionary Computation | booktitle = Proceedings of 2000 Congress on Evolutionary Computation | publisher = IEEE | location = La Jolla, California, USA | doi = 10.1109/CEC.2000.870269 | accessdate = | isbn = 0-7803-6375-2 }}</ref> or [[medicine]]<ref>{{cite journal |last=Deng |first=Geng |author2=Ferris, M.C. |title=Neuro-dynamic programming for fractionated radiotherapy planning |journal=Springer Optimization and Its Applications |year=2008 |volume=12 |pages=47–70 |doi=10.1007/978-0-387-73299-2_3|citeseerx=10.1.1.137.8288 |series=Springer Optimization and Its Applications |isbn=978-0-387-73298-5 }}</ref> because of the ability of ANNs to mitigate losses of accuracy even when reducing the discretization grid density for numerically approximating the solution of the original control problems.
| + | 强化学习中,ANN通常被用作整个算法的一部分。【Bertsekas】和【Tsiksiklis】给【动态编程】加上ANN(给出神经动力的编程)并应用到如【车辆路径】和【自然资源管理】或【医药】领域中的多维非线性问题。因为ANN能够减小精度损失,甚至在为数值逼近原始控制问题解而降低离散化网格密度时。 |