更改
→反向传播
一个[https://en.wikipedia.org/wiki/Deep_neural_network 深度神经网络]可以使用标准反向传播算法判别地训练。反向传播是一种计算关于ANN中权重的[https://en.wikipedia.org/wiki/Loss_function 损失函数](产生与给定状态相联系的损失)[https://en.wikipedia.org/wiki/Gradient 梯度]的方法。
一个[https://en.wikipedia.org/wiki/Deep_neural_network 深度神经网络]可以使用标准反向传播算法判别地训练。反向传播是一种计算关于ANN中权重的[https://en.wikipedia.org/wiki/Loss_function 损失函数](产生与给定状态相联系的损失)[https://en.wikipedia.org/wiki/Gradient 梯度]的方法。
连续反向传播的基础<ref name="SCHIDHUB2"/><ref name="scholarpedia2">{{cite journal|year=2015|title=Deep Learning|url=http://www.scholarpedia.org/article/Deep_Learning|journal=Scholarpedia|volume=10|issue=11|page=32832|doi=10.4249/scholarpedia.32832|last1=Schmidhuber|first1=Jürgen|authorlink=Jürgen Schmidhuber|bibcode=2015SchpJ..1032832S}}</ref><ref name=":5">{{Cite journal|last=Dreyfus|first=Stuart E.|date=1990-09-01|title=Artificial neural networks, back propagation, and the Kelley-Bryson gradient procedure|url=http://arc.aiaa.org/doi/10.2514/3.25422|journal=Journal of Guidance, Control, and Dynamics|volume=13|issue=5|pages=926–928|doi=10.2514/3.25422|issn=0731-5090|bibcode=1990JGCD...13..926D}}</ref><ref name="mizutani2000">Eiji Mizutani, [[Stuart Dreyfus]], Kenichi Nishio (2000). On derivation of MLP backpropagation from the Kelley-Bryson optimal-control gradient formula and its application. Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2000), Como Italy, July 2000. [http://queue.ieor.berkeley.edu/People/Faculty/dreyfus-pubs/ijcnn2k.pdf Online]</ref> 由[https://en.wikipedia.org/wiki/Henry_J._Kelley Kelley]<ref name="kelley1960">{{cite journal|year=1960|title=Gradient theory of optimal flight paths|url=http://arc.aiaa.org/doi/abs/10.2514/8.5282?journalCode=arsj|journal=Ars Journal|volume=30|issue=10|pages=947–954|doi=10.2514/8.5282|last1=Kelley|first1=Henry J.|authorlink=Henry J. Kelley}}</ref> 在1960和[https://en.wikipedia.org/wiki/Arthur_E._Bryson Bryson]在1961<ref name="bryson1961">[[Arthur E. Bryson]] (1961, April). A gradient method for optimizing multi-stage allocation processes. In Proceedings of the Harvard Univ. Symposium on digital computers and their applications.</ref>使用[https://en.wikipedia.org/wiki/Chain_rule 动态编程]的原则从[https://en.wikipedia.org/wiki/Control_theory 控制论]引出。1962,[https://en.wikipedia.org/wiki/Stuart_Dreyfus Dreyfus]发表了只基于[https://en.wikipedia.org/wiki/Chain_rule 链式法则]<ref name="dreyfus1962">{{cite journal|year=1962|title=The numerical solution of variational problems|url=https://www.researchgate.net/publication/256244271_The_numerical_solution_of_variational_problems|journal=Journal of Mathematical Analysis and Applications|volume=5|issue=1|pages=30–45|doi=10.1016/0022-247x(62)90004-5|last1=Dreyfus|first1=Stuart|authorlink=Stuart Dreyfus}}</ref>的更简单的衍生。1969,Bryson和[https://en.wikipedia.org/wiki/Yu-Chi_Ho Ho]把它描述成一种多级动态系统优化方法。<ref>{{cite book|url={{google books |plainurl=y |id=8jZBksh-bUMC|page=578}}|title=Artificial Intelligence A Modern Approach|last2=Norvig|first2=Peter|publisher=Prentice Hall|year=2010|isbn=978-0-13-604259-4|page=578|quote=The most popular method for learning in multilayer networks is called Back-propagation.|author-link2=Peter Norvig|first1=Stuart J.|last1=Russell|author-link1=Stuart J. Russell}}</ref><ref name="Bryson1969">{{cite book|url={{google books |plainurl=y |id=1bChDAEACAAJ|page=481}}|title=Applied Optimal Control: Optimization, Estimation and Control|last=Bryson|first=Arthur Earl|publisher=Blaisdell Publishing Company or Xerox College Publishing|year=1969|page=481}}</ref>1970,[https://en.wikipedia.org/wiki/Seppo_Linnainmaa Linnainmaa]最终发表了嵌套[https://en.wikipedia.org/wiki/Differentiable_function 可微函数]<ref name="lin1970">[[Seppo Linnainmaa]] (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 6–7.</ref><ref name="lin1976">{{cite journal|year=1976|title=Taylor expansion of the accumulated rounding error|url=|journal=BIT Numerical Mathematics|volume=16|issue=2|pages=146–160|doi=10.1007/bf01931367|last1=Linnainmaa|first1=Seppo|authorlink=Seppo Linnainmaa}}</ref> 的离散连接网络[https://en.wikipedia.org/wiki/Automatic_differentiation 自动差分机](AD)的通用方法。这对应于反向传播的现代版本,它在网络稀疏时仍有效<ref name="SCHIDHUB2"/><ref name="scholarpedia2"/><ref name="grie2012">{{Cite journal|last=Griewank|first=Andreas|date=2012|title=Who Invented the Reverse Mode of Differentiation?|url=http://www.math.uiuc.edu/documenta/vol-ismp/52_griewank-andreas-b.pdf|journal=Documenta Matematica, Extra Volume ISMP|volume=|pages=389–400|via=}}</ref><ref name="grie2008">{{cite book|url={{google books |plainurl=y |id=xoiiLaRxcbEC}}|title=Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, Second Edition|last2=Walther|first2=Andrea|publisher=SIAM|year=2008|isbn=978-0-89871-776-1|first1=Andreas|last1=Griewank}}</ref>。1973<ref name="dreyfus1973">{{cite journal|year=1973|title=The computational solution of optimal control problems with time lag|url=|journal=IEEE Transactions on Automatic Control|volume=18|issue=4|pages=383–385|doi=10.1109/tac.1973.1100330|last1=Dreyfus|first1=Stuart|authorlink=Stuart Dreyfus}}</ref> ,Dreyfus使用反向传播适配与误差梯度成比例的控制器【参数】。1974,【Werbos】提出将这个规则应用到ANN上的可能<ref name="werbos1974">[[Paul Werbos]] (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University.</ref>,1982他将LInnainmaa的AD方法以今天广泛使用的方式应用到神经网络上<ref name="scholarpedia2"/><ref name="werbos1982">{{Cite book|url=http://werbos.com/Neural/SensitivityIFIPSeptember1981.pdf|title=System modeling and optimization|last=Werbos|first=Paul|authorlink=Paul Werbos|publisher=Springer|year=1982|isbn=|location=|pages=762–770|chapter=Applications of advances in nonlinear sensitivity analysis}}</ref>。1986, 【Rumelhart】, Hinton和【Williams】注意到这种方法可以产生有用的神经网络隐藏层到来数据的内部表征。<ref name=":4">{{Cite journal|last=Rumelhart|first=David E.|last2=Hinton|first2=Geoffrey E.|last3=Williams|first3=Ronald J.|title=Learning representations by back-propagating errors|url=http://www.nature.com/articles/Art323533a0|journal=Nature|volume=323|issue=6088|pages=533–536|doi=10.1038/323533a0|year=1986|bibcode=1986Natur.323..533R}}</ref> 1933,Wan第一个<ref name="SCHIDHUB2"/> 用反向传播赢得国际模式识别竞赛。<ref name="wan1993">Eric A. Wan (1993). "Time series prediction by using a connectionist network with internal delay lines." In ''Proceedings of the Santa Fe Institute Studies in the Sciences of Complexity'', '''15''': p. 195. Addison-Wesley Publishing Co.</ref>
连续反向传播的基础<ref name="SCHIDHUB2"/><ref name="scholarpedia2">{{cite journal|year=2015|title=Deep Learning|url=http://www.scholarpedia.org/article/Deep_Learning|journal=Scholarpedia|volume=10|issue=11|page=32832|doi=10.4249/scholarpedia.32832|last1=Schmidhuber|first1=Jürgen|authorlink=Jürgen Schmidhuber|bibcode=2015SchpJ..1032832S}}</ref><ref name=":5">{{Cite journal|last=Dreyfus|first=Stuart E.|date=1990-09-01|title=Artificial neural networks, back propagation, and the Kelley-Bryson gradient procedure|url=http://arc.aiaa.org/doi/10.2514/3.25422|journal=Journal of Guidance, Control, and Dynamics|volume=13|issue=5|pages=926–928|doi=10.2514/3.25422|issn=0731-5090|bibcode=1990JGCD...13..926D}}</ref><ref name="mizutani2000">Eiji Mizutani, [[Stuart Dreyfus]], Kenichi Nishio (2000). On derivation of MLP backpropagation from the Kelley-Bryson optimal-control gradient formula and its application. Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2000), Como Italy, July 2000. [http://queue.ieor.berkeley.edu/People/Faculty/dreyfus-pubs/ijcnn2k.pdf Online]</ref> 由[https://en.wikipedia.org/wiki/Henry_J._Kelley Kelley]<ref name="kelley1960">{{cite journal|year=1960|title=Gradient theory of optimal flight paths|url=http://arc.aiaa.org/doi/abs/10.2514/8.5282?journalCode=arsj|journal=Ars Journal|volume=30|issue=10|pages=947–954|doi=10.2514/8.5282|last1=Kelley|first1=Henry J.|authorlink=Henry J. Kelley}}</ref> 在1960和[https://en.wikipedia.org/wiki/Arthur_E._Bryson Bryson]在1961<ref name="bryson1961">[[Arthur E. Bryson]] (1961, April). A gradient method for optimizing multi-stage allocation processes. In Proceedings of the Harvard Univ. Symposium on digital computers and their applications.</ref>使用[https://en.wikipedia.org/wiki/Chain_rule 动态编程]的原则从[https://en.wikipedia.org/wiki/Control_theory 控制论]引出。1962,[https://en.wikipedia.org/wiki/Stuart_Dreyfus Dreyfus]发表了只基于[https://en.wikipedia.org/wiki/Chain_rule 链式法则]<ref name="dreyfus1962">{{cite journal|year=1962|title=The numerical solution of variational problems|url=https://www.researchgate.net/publication/256244271_The_numerical_solution_of_variational_problems|journal=Journal of Mathematical Analysis and Applications|volume=5|issue=1|pages=30–45|doi=10.1016/0022-247x(62)90004-5|last1=Dreyfus|first1=Stuart|authorlink=Stuart Dreyfus}}</ref>的更简单的衍生。1969,Bryson和[https://en.wikipedia.org/wiki/Yu-Chi_Ho Ho]把它描述成一种多级动态系统优化方法。<ref>{{cite book|url={{google books |plainurl=y |id=8jZBksh-bUMC|page=578}}|title=Artificial Intelligence A Modern Approach|last2=Norvig|first2=Peter|publisher=Prentice Hall|year=2010|isbn=978-0-13-604259-4|page=578|quote=The most popular method for learning in multilayer networks is called Back-propagation.|author-link2=Peter Norvig|first1=Stuart J.|last1=Russell|author-link1=Stuart J. Russell}}</ref><ref name="Bryson1969">{{cite book|url={{google books |plainurl=y |id=1bChDAEACAAJ|page=481}}|title=Applied Optimal Control: Optimization, Estimation and Control|last=Bryson|first=Arthur Earl|publisher=Blaisdell Publishing Company or Xerox College Publishing|year=1969|page=481}}</ref>1970,[https://en.wikipedia.org/wiki/Seppo_Linnainmaa Linnainmaa]最终发表了嵌套[https://en.wikipedia.org/wiki/Differentiable_function 可微函数]<ref name="lin1970">[[Seppo Linnainmaa]] (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 6–7.</ref><ref name="lin1976">{{cite journal|year=1976|title=Taylor expansion of the accumulated rounding error|url=|journal=BIT Numerical Mathematics|volume=16|issue=2|pages=146–160|doi=10.1007/bf01931367|last1=Linnainmaa|first1=Seppo|authorlink=Seppo Linnainmaa}}</ref> 的离散连接网络[https://en.wikipedia.org/wiki/Automatic_differentiation 自动差分机](AD)的通用方法。这对应于反向传播的现代版本,它在网络稀疏时仍有效<ref name="SCHIDHUB2"/><ref name="scholarpedia2"/><ref name="grie2012">{{Cite journal|last=Griewank|first=Andreas|date=2012|title=Who Invented the Reverse Mode of Differentiation?|url=http://www.math.uiuc.edu/documenta/vol-ismp/52_griewank-andreas-b.pdf|journal=Documenta Matematica, Extra Volume ISMP|volume=|pages=389–400|via=}}</ref><ref name="grie2008">{{cite book|url={{google books |plainurl=y |id=xoiiLaRxcbEC}}|title=Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, Second Edition|last2=Walther|first2=Andrea|publisher=SIAM|year=2008|isbn=978-0-89871-776-1|first1=Andreas|last1=Griewank}}</ref>。1973<ref name="dreyfus1973">{{cite journal|year=1973|title=The computational solution of optimal control problems with time lag|url=|journal=IEEE Transactions on Automatic Control|volume=18|issue=4|pages=383–385|doi=10.1109/tac.1973.1100330|last1=Dreyfus|first1=Stuart|authorlink=Stuart Dreyfus}}</ref> ,Dreyfus使用反向传播适配与误差梯度成比例的控制器[https://en.wikipedia.org/wiki/Parameter 参数]。1974,[https://en.wikipedia.org/wiki/Paul_Werbos Werbos]提出将这个规则应用到ANN上的可能<ref name="werbos1974">[[Paul Werbos]] (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University.</ref>,1982他将LInnainmaa的AD方法以今天广泛使用的方式应用到神经网络上<ref name="scholarpedia2"/><ref name="werbos1982">{{Cite book|url=http://werbos.com/Neural/SensitivityIFIPSeptember1981.pdf|title=System modeling and optimization|last=Werbos|first=Paul|authorlink=Paul Werbos|publisher=Springer|year=1982|isbn=|location=|pages=762–770|chapter=Applications of advances in nonlinear sensitivity analysis}}</ref>。1986, [https://en.wikipedia.org/wiki/David_E._Rumelhart Rumelhart], Hinton和[https://en.wikipedia.org/wiki/Ronald_J._Williams Williams]注意到这种方法可以产生有用的神经网络隐藏层到来数据的内部表征。<ref name=":4">{{Cite journal|last=Rumelhart|first=David E.|last2=Hinton|first2=Geoffrey E.|last3=Williams|first3=Ronald J.|title=Learning representations by back-propagating errors|url=http://www.nature.com/articles/Art323533a0|journal=Nature|volume=323|issue=6088|pages=533–536|doi=10.1038/323533a0|year=1986|bibcode=1986Natur.323..533R}}</ref> 1933,Wan第一个<ref name="SCHIDHUB2"/> 用反向传播赢得国际模式识别竞赛。<ref name="wan1993">Eric A. Wan (1993). "Time series prediction by using a connectionist network with internal delay lines." In ''Proceedings of the Santa Fe Institute Studies in the Sciences of Complexity'', '''15''': p. 195. Addison-Wesley Publishing Co.</ref>
反向传播的权重更新可以通过[https://en.wikipedia.org/wiki/Stochastic_gradient_descent 随机梯度下降]完成,使用下面的等式:
: <math> w_{ij}(t + 1) = w_{ij}(t) + \eta\frac{\partial C}{\partial w_{ij}} +\xi(t) </math>
: <math> w_{ij}(t + 1) = w_{ij}(t) + \eta\frac{\partial C}{\partial w_{ij}} +\xi(t) </math>
其中<math> \eta </math> 是学习速率, <math> {C} </math>是损失函数, <math>\xi(t)</math> 是一个随机项。损失函数的选择由如学习类型(监督,无监督,强化等等)和【激活函数】等因素决定。例如,当在【多类分类】问题上使用监督学习,激活函数和损失函数的通常选择分别是【柔性最大值传输函数】和【交叉熵】函数。柔性最大值传输函数定义为 <math> p_j = \frac{\exp(x_j)}{\sum_k \exp(x_k)} </math> 其中 <math> p_j </math> 代表类的概率(单元<math> {j} </math>的输出), <math> x_j </math> 和 <math> x_k </math> 分别代表单元<math> {j} </math>和<math> k </math>在相同程度上的总输入。交叉熵定义为 <math> {C} = -\sum_j d_j \log(p_j) </math> 其中 <math> d_j </math> 代表输出单元<math> {j} </math> 的目标概率, <math> p_j </math> 是应用激活函数后 <math> {j} </math>的输出概率。<ref>{{Cite journal|last=Hinton|first=G.|last2=Deng|first2=L.|last3=Yu|first3=D.|last4=Dahl|first4=G. E.|last5=Mohamed|first5=A. r|last6=Jaitly|first6=N.|last7=Senior|first7=A.|last8=Vanhoucke|first8=V.|last9=Nguyen|first9=P.|date=November 2012|title=Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups|url=http://ieeexplore.ieee.org/document/6296526/|journal=IEEE Signal Processing Magazine|volume=29|issue=6|pages=82–97|doi=10.1109/msp.2012.2205597|issn=1053-5888|bibcode=2012ISPM...29...82H}}</ref>
其中<math> \eta </math> 是学习速率, <math> {C} </math>是损失函数, <math>\xi(t)</math> 是一个随机项。损失函数的选择由如学习类型(监督,无监督,强化等等)和[https://en.wikipedia.org/wiki/Activation_function 激活函数]等因素决定。例如,当在[https://en.wikipedia.org/wiki/Multiclass_classification 多类分类]问题上使用监督学习,激活函数和损失函数的通常选择分别是[https://en.wikipedia.org/wiki/Softmax_activation_function 柔性最大值传输函数]和[https://en.wikipedia.org/wiki/Cross_entropy 交叉熵]函数。柔性最大值传输函数定义为 <math> p_j = \frac{\exp(x_j)}{\sum_k \exp(x_k)} </math> 其中 <math> p_j </math> 代表类的概率(单元<math> {j} </math>的输出), <math> x_j </math> 和 <math> x_k </math> 分别代表单元<math> {j} </math>和<math> k </math>在相同程度上的总输入。交叉熵定义为 <math> {C} = -\sum_j d_j \log(p_j) </math> 其中 <math> d_j </math> 代表输出单元<math> {j} </math> 的目标概率, <math> p_j </math> 是应用激活函数后 <math> {j} </math>的输出概率。<ref>{{Cite journal|last=Hinton|first=G.|last2=Deng|first2=L.|last3=Yu|first3=D.|last4=Dahl|first4=G. E.|last5=Mohamed|first5=A. r|last6=Jaitly|first6=N.|last7=Senior|first7=A.|last8=Vanhoucke|first8=V.|last9=Nguyen|first9=P.|date=November 2012|title=Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups|url=http://ieeexplore.ieee.org/document/6296526/|journal=IEEE Signal Processing Magazine|volume=29|issue=6|pages=82–97|doi=10.1109/msp.2012.2205597|issn=1053-5888|bibcode=2012ISPM...29...82H}}</ref>
这可以被用于以二元掩码的形式输出目标[https://en.wikipedia.org/wiki/Minimum_bounding_box 包围盒]。它们也可以用于多元回归来增加局部精度。基于DNN的回归除作为一个好的分类器外还可以学习捕获几何信息特征。它们免除了显式模型部分和它们的关系。这有助于扩大可以被学习的目标种类。模型由多层组成,每层有一个[https://en.wikipedia.org/wiki/Rectified_linear_unit 线性整流单元]作为它的非线性变换激活函数。一些层是卷积的,其他层是全连接的。每个卷积层有一个额外的最大池化。这个网络被训练[https://en.wikipedia.org/wiki/Minimum_mean_square_error 最小化][https://en.wikipedia.org/wiki/L2_norm ''L''<sup>2</sup> 误差]
来预测整个训练集范围的掩码包含代表掩码的包围盒。【?】
来预测整个训练集范围的掩码包含代表掩码的包围盒。【?】
反向传播的替代包括[https://en.wikipedia.org/wiki/Extreme_Learning_Machines 极端学习机]<ref>{{cite journal|last2=Zhu|first2=Qin-Yu|last3=Siew|first3=Chee-Kheong|year=2006|title=Extreme learning machine: theory and applications|url=|journal=Neurocomputing|volume=70|issue=1|pages=489–501|doi=10.1016/j.neucom.2005.12.126|last1=Huang|first1=Guang-Bin}}</ref>,不使用回溯法<ref>{{cite arXiv|eprint=1507.07680|first=Yann|last=Ollivier|first2=Guillaume|last2=Charpiat|title=Training recurrent networks without backtracking|year=2015|class=cs.NE}}</ref>训练的“无权重”<ref>ESANN. 2009</ref><ref name="RBMTRAIN">{{Cite journal|last=Hinton|first=G. E.|date=2010|title=A Practical Guide to Training Restricted Boltzmann Machines|url=https://www.researchgate.net/publication/221166159_A_brief_introduction_to_Weightless_Neural_Systems|journal=Tech. Rep. UTML TR 2010-003,|volume=|pages=|via=}}</ref>网络<ref>{{cite journal|year=2013|title=The no-prop algorithm: A new learning algorithm for multilayer neural networks|url=|journal=Neural Networks|volume=37|issue=|pages=182–188|doi=10.1016/j.neunet.2012.09.020|last1=Widrow|first1=Bernard|display-authors=etal}}</ref>,和[https://en.wikipedia.org/wiki/Holographic_associative_memory 非联结主义神经网络]
===学习范式(Learning paradigms)===
===学习范式(Learning paradigms)===