更改
跳到导航
跳到搜索
第67行:
第67行:
− As of 2011, the state of the art in deep learning feedforward networks alternated between convolutional layers and max-pooling layers,<ref name=":6" /><ref name="martines2013">{{cite journal|last2=Bengio|first2=Y.|last3=Yannakakis|first3=G. N.|year=2013|title=Learning Deep Physiological Models of Affect|url=|journal=IEEE Computational Intelligence|volume=8|issue=2|pages=20–33|doi=10.1109/mci.2013.2247823|last1=Martines|first1=H.}}</ref> topped by several fully or sparsely connected layers followed by a final classification layer. Learning is usually done without unsupervised pre-training.+
− Such supervised deep learning methods were the first to achieve human-competitive performance on certain tasks.<ref name=":92"/>+
− ANNs were able to guarantee shift invariance to deal with small and large natural objects in large cluttered scenes, only when invariance extended beyond shift, to all ANN-learned concepts, such as location, type (object class label), scale, lighting and others. This was realized in Developmental Networks (DNs)<ref name="Weng2011">J. Weng, "[http://www.cse.msu.edu/~weng/research/WhyPass-Weng-NI-2011.pdf Why Have We Passed 'Neural Networks Do not Abstract Well'?]," ''Natural Intelligence: the INNS Magazine'', vol. 1, no.1, pp. 13–22, 2011.</ref> whose embodiments are Where-What Networks, WWN-1 (2008)<ref name="Weng08">Z. Ji, J. Weng, and D. Prokhorov, "[http://www.cse.msu.edu/~weng/research/ICDL08_0077.pdf Where-What Network 1: Where and What Assist Each Other Through Top-down Connections]," ''Proc. 7th International Conference on Development and Learning (ICDL'08)'', Monterey, CA, Aug. 9–12, pp. 1–6, 2008.</ref> through WWN-7 (2013).<ref name="Weng13">X. Wu, G. Guo, and J. Weng, "[http://www.cse.msu.edu/~weng/research/WWN7-Wu-ICBM-2013.pdf Skull-closed Autonomous Development: WWN-7 Dealing with Scales]," ''Proc. International Conference on Brain-Mind'', July 27–28, East Lansing, Michigan, pp. 1–9, 2013.</ref>+
+
− +
− An ''artificial neural network'' is a network of simple elements called ''[[artificial neurons]]'', which receive input, change their internal state (''activation'') according to that input, and produce output depending on the input and activation. The ''network'' forms by connecting the output of certain neurons to the input of other neurons forming a [[Directed graph|directed]], [[weighted graph]]. The weights as well as the [[Activation function|functions that compute the activation]] can be modified by a process called ''learning'' which is governed by a ''[[learning rule]]''.<ref name=Zell1994ch5.2>{{cite book |last=Zell |first=Andreas |year=1994 |title=Simulation Neuronaler Netze |trans-title=Simulation of Neural Networks |language=German |edition=1st |publisher=Addison-Wesley |chapter=chapter 5.2 |isbn=3-89319-554-8}}</ref>
第96行:
第96行:
− The ''propagation function'' computes the ''input'' <math>p_j(t)</math> to the neuron <math>j</math> from the outputs <math>o_i(t)</math> of predecessor neurons and typically has the form<ref name=Zell1994ch5.2 />+
− +
− The ''learning rule'' is a rule or an algorithm which modifies the parameters of the neural network, in order for a given input to the network to produce a favored output. This ''learning'' process typically amounts to modifying the weights and thresholds of the variables within the network.<ref name=Zell1994ch5.2 />+
− Neural network models can be viewed as simple mathematical models defining a function <math>\textstyle f : X \rightarrow Y </math> or a distribution over <math>\textstyle X</math> or both <math>\textstyle X</math> and <math>\textstyle Y</math>. Sometimes models are intimately associated with a particular learning rule. A common use of the phrase "ANN model" is really the definition of a ''class'' of such functions (where members of the class are obtained by varying parameters, connection weights, or specifics of the architecture such as the number of neurons or their connectivity).+
− +
− Mathematically, a neuron's network function <math>\textstyle f(x)</math> is defined as a composition of other functions <math>\textstyle g_i(x)</math>, that can further be decomposed into other functions. This can be conveniently represented as a network structure, with arrows depicting the dependencies between functions. A widely used type of composition is the ''nonlinear weighted sum'', where <math>\textstyle f (x) = K \left(\sum_i w_i g_i(x)\right) </math>, where <math>\textstyle K</math> (commonly referred to as the [[activation function]]<ref>{{cite web|url=http://www.cse.unsw.edu.au/~billw/mldict.html#activnfn|title=The Machine Learning Dictionary}}</ref>) is some predefined function, such as the [[hyperbolic function#Standard analytic expressions|hyperbolic tangent]] or [[sigmoid function]] or [[softmax function]] or [[ReLU|rectifier function]]. The important characteristic of the activation function is that it provides a smooth transition as input values change, i.e. a small change in input produces a small change in output. The following refers to a collection of functions <math>\textstyle g_i</math> as a [[Vector (mathematics and physics)|vector]] <math>\textstyle g = (g_1, g_2, \ldots, g_n)</math>.
−
− [[File:Ann dependency (graph).svg|thumb|150px|ANN dependency graph]]
−
− This figure depicts such a decomposition of <math>\textstyle f</math>, with dependencies between variables indicated by arrows. These can be interpreted in two ways.
−
− The first view is the functional view: the input <math>\textstyle x</math> is transformed into a 3-dimensional vector <math>\textstyle h</math>, which is then transformed into a 2-dimensional vector <math>\textstyle g</math>, which is finally transformed into <math>\textstyle f</math>. This view is most commonly encountered in the context of [[Mathematical optimization|optimization]].
−
− The second view is the probabilistic view: the [[random variable]] <math>\textstyle F = f(G) </math> depends upon the random variable <math>\textstyle G = g(H)</math>, which depends upon <math>\textstyle H=h(X)</math>, which depends upon the random variable <math>\textstyle X</math>. This view is most commonly encountered in the context of [[graphical models]].
−
− The two views are largely equivalent. In either case, for this particular architecture, the components of individual layers are independent of each other (e.g., the components of <math>\textstyle g</math> are independent of each other given their input <math>\textstyle h</math>). This naturally enables a degree of parallelism in the implementation.
−
− [[File:Recurrent ann dependency graph.png|thumb|120px| Two separate depictions of the recurrent ANN dependency graph]]
− Networks such as the previous one are commonly called [[feedforward neural network|feedforward]], because their graph is a [[directed acyclic graph]]. Networks with [[Cycle (graph theory)|cycles]] are commonly called [[Recurrent neural network|recurrent]]. Such networks are commonly depicted in the manner shown at the top of the figure, where <math>\textstyle f</math> is shown as being dependent upon itself. However, an implied temporal dependence is not shown.+
− ===Learning===
− {{See also|Mathematical optimization|Estimation theory|Machine learning}}
− The possibility of learning has attracted the most interest in neural networks. Given a specific ''task'' to solve, and a class of functions <math>\textstyle F</math>, learning means using a set of observations to find <math>\textstyle f^{*} \in F</math> which solves the task in some optimal sense.+
+
+
+
− This entails defining a cost function <math>\textstyle C : F \rightarrow \mathbb{R}</math> such that, for the optimal solution <math>\textstyle f^*</math>, <math>\textstyle C(f^*) \leq C(f)</math> <math>\textstyle \forall f \in F</math>{{snd}} i.e., no solution has a cost less than the cost of the optimal solution (see [[mathematical optimization]]).+
+
− The cost function <math>\textstyle C</math> is an important concept in learning, as it is a measure of how far away a particular solution is from an optimal solution to the problem to be solved. Learning algorithms search through the solution space to find a function that has the smallest possible cost.+
− For applications where the solution is data dependent, the cost must necessarily be a function of the observations, otherwise the model would not relate to the data. It is frequently defined as a [[statistic]] to which only approximations can be made. As a simple example, consider the problem of finding the model <math>\textstyle f</math>, which minimizes <math>\textstyle C=E\left[(f(x) - y)^2\right]</math>, for data pairs <math>\textstyle (x,y)</math> drawn from some distribution <math>\textstyle \mathcal{D}</math>. In practical situations we would only have <math>\textstyle N</math> samples from <math>\textstyle \mathcal{D}</math> and thus, for the above example, we would only minimize <math>\textstyle \hat{C}=\frac{1}{N}\sum_{i=1}^N (f(x_i)-y_i)^2</math>. Thus, the cost is minimized over a sample of the data rather than the entire distribution.+
+
+
− When <math>\textstyle N \rightarrow \infty</math> some form of [[online machine learning]] must be used, where the cost is reduced as each new example is seen. While online machine learning is often used when <math>\textstyle \mathcal{D}</math> is fixed, it is most useful in the case where the distribution changes slowly over time. In neural network methods, some form of online machine learning is frequently used for finite datasets.+
− ====Choosing a cost function====+
− While it is possible to define an [[ad hoc]] cost function, frequently a particular cost (function) is used, either because it has desirable properties (such as [[Convex function|convexity]]) or because it arises naturally from a particular formulation of the problem (e.g., in a probabilistic formulation the [[posterior probability]] of the model can be used as an inverse cost). Ultimately, the cost function depends on the task.
− +
− {{Main|Backpropagation}}+
− A [[Deep neural network|DNN]] can be [[Discriminative model|discriminatively]] trained with the standard backpropagation algorithm. Backpropagation is a method to calculate the [[gradient]] of the [[loss function]] (produces the cost associated with a given state) with respect to the weights in an ANN.
− The basics of continuous backpropagation<ref name="SCHIDHUB2"/><ref name="scholarpedia2">{{cite journal|year=2015|title=Deep Learning|url=http://www.scholarpedia.org/article/Deep_Learning|journal=Scholarpedia|volume=10|issue=11|page=32832|doi=10.4249/scholarpedia.32832|last1=Schmidhuber|first1=Jürgen|authorlink=Jürgen Schmidhuber|bibcode=2015SchpJ..1032832S}}</ref><ref name=":5">{{Cite journal|last=Dreyfus|first=Stuart E.|date=1990-09-01|title=Artificial neural networks, back propagation, and the Kelley-Bryson gradient procedure|url=http://arc.aiaa.org/doi/10.2514/3.25422|journal=Journal of Guidance, Control, and Dynamics|volume=13|issue=5|pages=926–928|doi=10.2514/3.25422|issn=0731-5090|bibcode=1990JGCD...13..926D}}</ref><ref name="mizutani2000">Eiji Mizutani, [[Stuart Dreyfus]], Kenichi Nishio (2000). On derivation of MLP backpropagation from the Kelley-Bryson optimal-control gradient formula and its application. Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN 2000), Como Italy, July 2000. [http://queue.ieor.berkeley.edu/People/Faculty/dreyfus-pubs/ijcnn2k.pdf Online]</ref> were derived in the context of [[control theory]] by [[Henry J. Kelley|Kelley]]<ref name="kelley1960">{{cite journal|year=1960|title=Gradient theory of optimal flight paths|url=http://arc.aiaa.org/doi/abs/10.2514/8.5282?journalCode=arsj|journal=Ars Journal|volume=30|issue=10|pages=947–954|doi=10.2514/8.5282|last1=Kelley|first1=Henry J.|authorlink=Henry J. Kelley}}</ref> in 1960 and by [[Arthur E. Bryson|Bryson]] in 1961,<ref name="bryson1961">[[Arthur E. Bryson]] (1961, April). A gradient method for optimizing multi-stage allocation processes. In Proceedings of the Harvard Univ. Symposium on digital computers and their applications.</ref> using principles of [[dynamic programming]]. In 1962, [[Stuart Dreyfus|Dreyfus]] published a simpler derivation based only on the [[chain rule]].<ref name="dreyfus1962">{{cite journal|year=1962|title=The numerical solution of variational problems|url=https://www.researchgate.net/publication/256244271_The_numerical_solution_of_variational_problems|journal=Journal of Mathematical Analysis and Applications|volume=5|issue=1|pages=30–45|doi=10.1016/0022-247x(62)90004-5|last1=Dreyfus|first1=Stuart|authorlink=Stuart Dreyfus}}</ref> Bryson and [[Yu-Chi Ho|Ho]] described it as a multi-stage dynamic system optimization method in 1969.<ref>{{cite book|url={{google books |plainurl=y |id=8jZBksh-bUMC|page=578}}|title=Artificial Intelligence A Modern Approach|last2=Norvig|first2=Peter|publisher=Prentice Hall|year=2010|isbn=978-0-13-604259-4|page=578|quote=The most popular method for learning in multilayer networks is called Back-propagation.|author-link2=Peter Norvig|first1=Stuart J.|last1=Russell|author-link1=Stuart J. Russell}}</ref><ref name="Bryson1969">{{cite book|url={{google books |plainurl=y |id=1bChDAEACAAJ|page=481}}|title=Applied Optimal Control: Optimization, Estimation and Control|last=Bryson|first=Arthur Earl|publisher=Blaisdell Publishing Company or Xerox College Publishing|year=1969|page=481}}</ref> In 1970, [[Seppo Linnainmaa|Linnainmaa]] finally published the general method for [[automatic differentiation]] (AD) of discrete connected networks of nested [[Differentiable function|differentiable]] functions.<ref name="lin1970">[[Seppo Linnainmaa]] (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 6–7.</ref><ref name="lin1976">{{cite journal|year=1976|title=Taylor expansion of the accumulated rounding error|url=|journal=BIT Numerical Mathematics|volume=16|issue=2|pages=146–160|doi=10.1007/bf01931367|last1=Linnainmaa|first1=Seppo|authorlink=Seppo Linnainmaa}}</ref> This corresponds to the modern version of backpropagation which is efficient even when the networks are sparse.<ref name="SCHIDHUB2"/><ref name="scholarpedia2"/><ref name="grie2012">{{Cite journal|last=Griewank|first=Andreas|date=2012|title=Who Invented the Reverse Mode of Differentiation?|url=http://www.math.uiuc.edu/documenta/vol-ismp/52_griewank-andreas-b.pdf|journal=Documenta Matematica, Extra Volume ISMP|volume=|pages=389–400|via=}}</ref><ref name="grie2008">{{cite book|url={{google books |plainurl=y |id=xoiiLaRxcbEC}}|title=Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, Second Edition|last2=Walther|first2=Andrea|publisher=SIAM|year=2008|isbn=978-0-89871-776-1|first1=Andreas|last1=Griewank}}</ref> In 1973, Dreyfus used backpropagation to adapt [[parameter]]s of controllers in proportion to error gradients.<ref name="dreyfus1973">{{cite journal|year=1973|title=The computational solution of optimal control problems with time lag|url=|journal=IEEE Transactions on Automatic Control|volume=18|issue=4|pages=383–385|doi=10.1109/tac.1973.1100330|last1=Dreyfus|first1=Stuart|authorlink=Stuart Dreyfus}}</ref> In 1974, [[Paul Werbos|Werbos]] mentioned the possibility of applying this principle to ANNs,<ref name="werbos1974">[[Paul Werbos]] (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD thesis, Harvard University.</ref> and in 1982, he applied Linnainmaa's AD method to neural networks in the way that is widely used today.<ref name="scholarpedia2"/><ref name="werbos1982">{{Cite book|url=http://werbos.com/Neural/SensitivityIFIPSeptember1981.pdf|title=System modeling and optimization|last=Werbos|first=Paul|authorlink=Paul Werbos|publisher=Springer|year=1982|isbn=|location=|pages=762–770|chapter=Applications of advances in nonlinear sensitivity analysis}}</ref> In 1986, [[David E. Rumelhart|Rumelhart]], Hinton and [[Ronald J. Williams|Williams]] noted that this method can generate useful internal representations of incoming data in hidden layers of neural networks.<ref name=":4">{{Cite journal|last=Rumelhart|first=David E.|last2=Hinton|first2=Geoffrey E.|last3=Williams|first3=Ronald J.|title=Learning representations by back-propagating errors|url=http://www.nature.com/articles/Art323533a0|journal=Nature|volume=323|issue=6088|pages=533–536|doi=10.1038/323533a0|year=1986|bibcode=1986Natur.323..533R}}</ref> In 1993, Wan was the first<ref name="SCHIDHUB2"/> to win an international pattern recognition contest through backpropagation.<ref name="wan1993">Eric A. Wan (1993). "Time series prediction by using a connectionist network with internal delay lines." In ''Proceedings of the Santa Fe Institute Studies in the Sciences of Complexity'', '''15''': p. 195. Addison-Wesley Publishing Co.</ref>+
+
− The weight updates of backpropagation can be done via [[stochastic gradient descent]] using the following equation:+
+
− where, <math> \eta </math> is the learning rate, <math> C </math> is the cost (loss) function and <math>\xi(t)</math> a stochastic term. The choice of the cost function depends on factors such as the learning type (supervised, unsupervised, [[Reinforcement learning|reinforcement]], etc.) and the [[activation function]]. For example, when performing supervised learning on a [[multiclass classification]] problem, common choices for the activation function and cost function are the [[Softmax activation function|softmax]] function and [[cross entropy]] function, respectively. The softmax function is defined as <math> p_j = \frac{\exp(x_j)}{\sum_k \exp(x_k)} </math> where <math> p_j </math> represents the class probability (output of the unit <math> j </math>) and <math> x_j </math> and <math> x_k </math> represent the total input to units <math> j </math> and <math> k </math> of the same level respectively. Cross entropy is defined as <math> C = -\sum_j d_j \log(p_j) </math> where <math> d_j </math> represents the target probability for output unit <math> j </math> and <math> p_j </math> is the probability output for <math> j </math> after applying the activation function.<ref>{{Cite journal|last=Hinton|first=G.|last2=Deng|first2=L.|last3=Yu|first3=D.|last4=Dahl|first4=G. E.|last5=Mohamed|first5=A. r|last6=Jaitly|first6=N.|last7=Senior|first7=A.|last8=Vanhoucke|first8=V.|last9=Nguyen|first9=P.|date=November 2012|title=Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups|url=http://ieeexplore.ieee.org/document/6296526/|journal=IEEE Signal Processing Magazine|volume=29|issue=6|pages=82–97|doi=10.1109/msp.2012.2205597|issn=1053-5888|bibcode=2012ISPM...29...82H}}</ref>+
− +
− These can be used to output object [[Minimum bounding box|bounding boxes]] in the form of a binary mask. They are also used for multi-scale regression to increase localization precision. DNN-based regression can learn features that capture geometric information in addition to serving as a good classifier. They remove the requirement to explicitly model parts and their relations. This helps to broaden the variety of objects that can be learned. The model consists of multiple layers, each of which has a [[rectified linear unit]] as its activation function for non-linear transformation. Some layers are convolutional, while others are fully connected. Every convolutional layer has an additional max pooling. The network is trained to [[Minimum mean square error|minimize]] [[L2 norm|''L''<sup>2</sup> error]] for predicting the mask ranging over the entire training set containing bounding boxes represented as masks.+
− +
− Alternatives to backpropagation include [[Extreme Learning Machines]],<ref>{{cite journal|last2=Zhu|first2=Qin-Yu|last3=Siew|first3=Chee-Kheong|year=2006|title=Extreme learning machine: theory and applications|url=|journal=Neurocomputing|volume=70|issue=1|pages=489–501|doi=10.1016/j.neucom.2005.12.126|last1=Huang|first1=Guang-Bin}}</ref> "No-prop" networks,<ref>{{cite journal|year=2013|title=The no-prop algorithm: A new learning algorithm for multilayer neural networks|url=|journal=Neural Networks|volume=37|issue=|pages=182–188|doi=10.1016/j.neunet.2012.09.020|last1=Widrow|first1=Bernard|display-authors=etal}}</ref> training without backtracking,<ref>{{cite arXiv|eprint=1507.07680|first=Yann|last=Ollivier|first2=Guillaume|last2=Charpiat|title=Training recurrent networks without backtracking|year=2015|class=cs.NE}}</ref> "weightless" networks,<ref>ESANN. 2009</ref><ref name="RBMTRAIN">{{Cite journal|last=Hinton|first=G. E.|date=2010|title=A Practical Guide to Training Restricted Boltzmann Machines|url=https://www.researchgate.net/publication/221166159_A_brief_introduction_to_Weightless_Neural_Systems|journal=Tech. Rep. UTML TR 2010-003,|volume=|pages=|via=}}</ref> and [[Holographic associative memory|non-connectionist neural networks]].
−
− ===Learning paradigms===
− The three major learning paradigms each correspond to a particular learning task. These are [[supervised learning]], [[unsupervised learning]] and [[reinforcement learning]].
−
− ==== Supervised learning ====
− [[Supervised learning]] uses a set of example pairs <math> (x, y), x \in X, y \in Y</math> and the aim is to find a function <math> f : X \rightarrow Y </math> in the allowed class of functions that matches the examples. In other words, we wish to infer the mapping implied by the data; the cost function is related to the mismatch between our mapping and the data and it implicitly contains prior knowledge about the problem domain.<ref>{{Cite journal|last=Ojha|first=Varun Kumar|last2=Abraham|first2=Ajith|last3=Snášel|first3=Václav|date=2017-04-01|title=Metaheuristic design of feedforward neural networks: A review of two decades of research|url=http://www.sciencedirect.com/science/article/pii/S0952197617300234|journal=Engineering Applications of Artificial Intelligence|volume=60|pages=97–116|doi=10.1016/j.engappai.2017.01.013}}</ref>
−
− A commonly used cost is the [[mean-squared error]], which tries to minimize the average squared error between the network's output, <math> f(x)</math>, and the target value <math> y</math> over all the example pairs. Minimizing this cost using [[gradient descent]] for the class of neural networks called [[multilayer perceptron]]s (MLP), produces the [[Backpropagation|backpropagation algorithm]] for training neural networks.
−
− Tasks that fall within the paradigm of supervised learning are [[pattern recognition]] (also known as classification) and [[Regression analysis|regression]] (also known as function approximation). The supervised learning paradigm is also applicable to sequential data (e.g., for hand writing, speech and gesture recognition). This can be thought of as learning with a "teacher", in the form of a function that provides continuous feedback on the quality of solutions obtained thus far.
−
− ====Unsupervised learning====
− In [[unsupervised learning]], some data <math>\textstyle x</math> is given and the cost function to be minimized, that can be any function of the data <math>\textstyle x</math> and the network's output, <math>\textstyle f</math>.
−
− The cost function is dependent on the task (the model domain) and any ''[[A priori and a posteriori|a priori]]'' assumptions (the implicit properties of the model, its parameters and the observed variables).
− As a trivial example, consider the model <math>\textstyle f(x) = a</math> where <math>\textstyle a</math> is a constant and the cost <math>\textstyle C=E[(x - f(x))^2]</math>. Minimizing this cost produces a value of <math>\textstyle a</math> that is equal to the mean of the data. The cost function can be much more complicated. Its form depends on the application: for example, in [[Data compression|compression]] it could be related to the [[mutual information]] between <math>\textstyle x</math> and <math>\textstyle f(x)</math>, whereas in statistical modeling, it could be related to the [[posterior probability]] of the model given the data (note that in both of those examples those quantities would be maximized rather than minimized).+
+
− Tasks that fall within the paradigm of unsupervised learning are in general [[Approximation|estimation]] problems; the applications include [[Data clustering|clustering]], the estimation of [[statistical distributions]], [[Data compression|compression]] and [[Bayesian spam filtering|filtering]].
− +
− {{See also|Stochastic control}}+
+
+
− In [[reinforcement learning]], data <math>\textstyle x</math> are usually not given, but generated by an agent's interactions with the environment. At each point in time <math>\textstyle t</math>, the agent performs an action <math>\textstyle y_t</math> and the environment generates an observation <math>\textstyle x_t</math> and an instantaneous cost <math>\textstyle c_t</math>, according to some (usually unknown) dynamics. The aim is to discover a policy for selecting actions that minimizes some measure of a long-term cost, e.g., the expected cumulative cost. The environment's dynamics and the long-term cost for each policy are usually unknown, but can be estimated.+
+
+
+
+
− More formally the environment is modeled as a [[Markov decision process]] (MDP) with states <math>\textstyle {s_1,...,s_n}\in S </math> and actions <math>\textstyle {a_1,...,a_m} \in A</math> with the following probability distributions: the instantaneous cost distribution <math>\textstyle P(c_t|s_t)</math>, the observation distribution <math>\textstyle P(x_t|s_t)</math> and the transition <math>\textstyle P(s_{t+1}|s_t, a_t)</math>, while a policy is defined as the conditional distribution over actions given the observations. Taken together, the two then define a [[Markov chain]] (MC). The aim is to discover the policy (i.e., the MC) that minimizes the cost.+
+
+
无编辑摘要
=== 卷积网络(Convolutional networks) ===
=== 卷积网络(Convolutional networks) ===
自2011起,深度学习前馈网络的艺术状态在卷积层和最大池化层之间切换,位于几层全连接或稀疏连接层和一层最终分类层之上。学习通常不需要非监督预学习。
这种监督深度学习方法第一次达到在某些任务中能挑战人类表现的水平。
ANN能够保证平移不变形来处理在大型聚类场景中的小和大的自然物体,仅当不变性扩展超过平移,对于所有ANN学习的概念如位置,类型(物体分类标记),大小,亮度等。
这被称为启发式网络 (DNs) 具体实现有哪里-什么网络(Where-What Networks), WWN-1 (2008)到 WWN-7 (2013).
==模型==
==模型==
一个“人工神经网络”是一个称为【人工神经元】的简单元素的网络,它们接收输入,根据输入改变内部状态(“激活”),然后依靠输入和激活产生输出,通过连接某些神经元的输出到其他神经元的输入的“网络”形式构成了一个【有向的】【有权图】。权重和【计算激活的函数】可以被称为“学习”的过程改变,这被【学习规则】控制。
===人工神经网络的组成部分(Components of an artificial neural network)===
===人工神经网络的组成部分(Components of an artificial neural network)===
====传播函数(Propagation function)====
====传播函数(Propagation function)====
“传播函数”计算“从前驱神经元的输出<math>o_i(t)</math>到神经元 <math>j</math>的输入”<math>p_j(t)</math>通常有这种形式:
: <math> p_j(t) = \sum_{i} o_i(t) w_{ij} </math>.
: <math> p_j(t) = \sum_{i} o_i(t) w_{ij} </math>
====学习规则(Learning rule)====
====学习规则(Learning rule)====
“学习规则”是一个改变神经网络参数的规则或算法,以便于对网络给定的输入产生偏好的输出。这个学习过程 改变网络中的变量权重和阈值。
===作为函数的神经网络(Neural networks as functions)===
===作为函数的神经网络(Neural networks as functions)===
神经网络模型可以被看成简单的数学模型,定义为一个函数<math>\textstyle f : X \rightarrow Y </math> 或者是一个 <math>\textstyle X</math> 上或 <math>\textstyle X</math> 和<math>\textstyle Y</math>上的分布。有时模型与一个特定学习规则紧密联系。短语“ANN模型”的通常使用确实是这种函数的“类”的定义(类的成员被不同参数,连接权重或结构的细节如神经元数量或他们的连接获得)
数学上,一个神经元的网络函数 <math>\textstyle f(x)</math> 被定义为其他函数 <math>\textstyle g_i(x)</math>的组合,它可以之后被分解为其他函数。这可以被方便地用一个网络结构表示,它有箭头描述函数间的依赖关系。一类广泛应用的组合是“非线性加权和”, <math>\textstyle f (x) = K \left(\sum_i w_i g_i(x)\right) </math>, 其中 <math>\textstyle K</math> (通常称为【激活函数】) 是某种预定义的函数,如【双曲正切】或【双弯曲函数】 或【柔性最大值传输函数】或【线性整流函数】。激活函数最重要的特点是它随输入值变化提供一个平滑的过渡,例如,在输入中一个小的变化产生输出中一个小的变化 。下面指的是一组函数 <math>\textstyle g_i</math>作为【向量】 <math>\textstyle g = (g_1, g_2, \ldots, g_n)</math>.
【File:Ann dependency (graph).svg|thumb|150px|ANN依赖图】
本图描述了 <math>\textstyle f</math>的带有箭头指示出的变量间依赖的这样一种分解,这些可以用两种方式解释。
第一种视角是功能上的:输入<math>\textstyle x</math> 转化成一个三维向量<math>\textstyle h</math>, 它接着转化为一个二维向量 <math>\textstyle g</math>,它最终转化成 <math>\textstyle f</math>. 这种视角在【优化】中经常遇到。
第二种视角是概率上的:【随机变量】 <math>\textstyle F = f(G) </math> 取决于随机变量 <math>\textstyle G = g(H)</math>,它取决于 <math>\textstyle H=h(X)</math>, 它取决于随机变量 <math>\textstyle X</math>.。这种视角在【图模型】中经常遇到。
这两种视角大部分等价。不论哪种情况,对于这种特定的结构,单独层的组成互相独立(例如,<math>\textstyle g</math> 的组成,给定它们的输入<math>\textstyle h</math>互相独立) 这自然地使实现中的并行成为可能。
【File:Recurrent ann dependency graph.png|thumb|120px| 循环ANN依赖图的两个单独描述】
前述的网络通常称为【前馈神经网络】,因为它们的图是【有向无环图】。带有【环】的网络通常称为【循环神经网络】。这种网络通常被图片顶部的方式描述,其中 <math>\textstyle f</math> 依赖它自己,而一个隐含的时间依赖没有显示。
===学习===
学习的可能性在神经网络吸引了最多的兴趣。给定一个特定的“任务”和一类函数<math>\textstyle F</math>待解决,学习意味着使用一组观测值寻找<math>\textstyle f^{*} \in F</math>,它以某种最优的道理解决任务。
这引发了定义一个损失函数 <math>\textstyle C : F \rightarrow \mathbb{R}</math> 使得对于最优解 <math>\textstyle f^*</math>, <math>\textstyle C(f^*) \leq C(f)</math> <math>\textstyle \forall f \in F</math>{{snd}} 也就是没有解有比最优解更小的损失。
损失函数<math>\textstyle C</math>是学习中一个重要的概念,因为它是衡量一个特定的解距离一个解决问题的最优解有多远。学习算法搜索解空间寻找一个有最小可能损失的函数。
对于解依赖数据的应用,损失必须必要地作为观测值的函数,否则模型会数据无关。通常定义为一个只能近似的【统计量】。一个简单的例子是考虑找到最小化<math>\textstyle C=E\left[(f(x) - y)^2\right]</math>的模型 <math>\textstyle f</math>,对于数据对<math>\textstyle (x,y)</math> 来自分布<math>\textstyle \mathcal{D}</math>.。在实际情况下我们可能只有 <math>\textstyle N</math>从 <math>\textstyle \mathcal{D}</math>采样,这样,对于上面的例子我们只能最小化 <math>\textstyle \hat{C}=\frac{1}{N}\sum_{i=1}^N (f(x_i)-y_i)^2</math>. 因此,损失被在数据的一个样本上而不是在整个分布上最小化。
当 <math>\textstyle N \rightarrow \infty</math>,必须使用【在线机器学习】的某种形式 ,其中损失随着每次观测到新的样本而减小。尽管通常当<math>\textstyle \mathcal{D}</math>固定时使用在线机器学习,它在分布随时间缓慢变化时最有用。在神经网络方法中,一些种类的在线机器学习通常被用于无限数据集。
====Backpropagation====
====选择一个损失函数====
虽然可能定义一个【特别的】损失函数,通常使用一个特定的损失函数,无论因为它有需要的性质(例如【凸性质】)或因为它从问题的一种特定公式中自然产生(例如在概率公式中模型的【后验概率】可以被用作相反损失)。最后,损失函数取决于任务。
====反向传播====
一个【深度神经网络】可以使用标准反向传播算法判别地训练。反向传播是一种计算关于ANN中权重的【损失函数】(产生与给定状态相联系的损失)【梯度】的方法。
连续反向传播的基础由【Kelley】在1960和【Bryson】在1961使用【动态编程】的原则从【控制论】引出。1962,【Dreyfus】发表了只基于【链式法则】的更简单的衍生。1969,Bryson和【Ho】把它描述成一种多级动态系统优化方法。1970,【Linnainmaa】最终发表了嵌套【可微函数】的离散连接网络【自动差分机】(AD)的通用方法。这对应于反向传播的现代版本,它在网络稀疏时仍有效。1973,Dreyfus使用反向传播适配与误差梯度成比例的控制器【参数】。1974,【Werbos】提出将这个规则应用到ANN上的可能,1982他将LInnainmaa的AD方法以今天广泛使用的方式应用到神经网络上。1933,Wan第一个用反向传播赢得国际模式识别竞赛。
反向传播的权重更新可以通过【随机梯度下降】完成,使用下面的等式:
: <math> w_{ij}(t + 1) = w_{ij}(t) + \eta\frac{\partial C}{\partial w_{ij}} +\xi(t) </math>
: <math> w_{ij}(t + 1) = w_{ij}(t) + \eta\frac{\partial C}{\partial w_{ij}} +\xi(t) </math>
其中<math> \eta </math> 是学习速率, <math> C </math>是损失函数, <math>\xi(t)</math> 是一个随机项。损失函数的选择由如学习类型(监督,无监督,强化等等)和【激活函数】等因素决定。例如,当在【多类分类】问题上使用监督学习,激活函数和损失函数的通常选择分别是【柔性最大值传输函数】和【交叉熵】函数。柔性最大值传输函数定义为 <math> p_j = \frac{\exp(x_j)}{\sum_k \exp(x_k)} </math> 其中 <math> p_j </math> 代表类的概率(单元<math> j </math>的输出), <math> x_j </math> 和 <math> x_k </math> 分别代表单元<math> j </math> 和 <math> k </math>在相同程度上的总输入。交叉熵定义为 <math> C = -\sum_j d_j \log(p_j) </math> 其中 <math> d_j </math> 代表输出单元<math> j </math> 的目标概率, <math> p_j </math> 是应用激活函数后 <math> j </math> 的输出概率。
这可以被用于以二元掩码的形式输出目标【包围盒】。它们也可以用于多元回归来增加局部精度。基于DNN的回归除作为一个好的分类器外还可以学习捕获几何信息特征。它们免除了显式模型部分和它们的关系。这有助于扩大可以被学习的目标种类。模型由多层组成,每层有一个【线性整流单元】作为它的非线性变换激活函数。一些层是卷积的,其他层是全连接的。每个卷积层有一个额外的最大池化。这个网络被训练【最小化】【''L''<sup>2</sup> 误差】
来预测整个训练集范围的掩码包含代表掩码的包围盒。【?】
反向传播的替代包括【极端学习机】,不使用回溯法训练的“无权重”网络,和【非联结主义神经网络】
===学习范式(Learning paradigms)===
三种主要学习范式对应于特定学习任务。它们是:【监督学习】,【无监督学习】和【强化学习】
====Reinforcement learning====
==== 监督学习(Supervised learning) ====
【监督学习】使用一组例子对<math> (x, y), x \in X, y \in Y</math> ,目标是在允许的函数类中找到一个函数 <math> f : X \rightarrow Y </math> 匹配例子。 换言之,我们希望推断数据隐含的映射;损失函数与我们的映射和数据间的不匹配相关,它隐含了关于问题域的先验知识。
通常使用的损失函数是【均方误差】,它对所有的例子对在网络输出 <math> f(x)</math>和目标值<math> y</math>之间最小化平均平方误差。最小化损失对一类叫做【多层感知机】(MLP)的一类神经网络使用了【梯度下降】,产生了训练神经网络的【反向传播算法】。
监督学习范式中的任务是【模式识别】(也被称为分类)和【回归】(也被称为函数逼近)。监督学习范式也可适用于序列数据(例如手写,语音和手势识别)。这可以被认为是和“老师”学习,以一个根据迄今为止得到解的质量提供连续反馈的函数形式。
====无监督学习(Unsupervised learning)====
在【无监督学习】中,给定一些数据 <math>\textstyle x</math> ,要最小化损失函数 ,损失函数可以是数据<math>\textstyle x</math>和网络输出<math>\textstyle f</math>的任何函数。
损失函数依赖任务(模型的域)和任何【先验的】假设(模型的隐含性质,它的参数和观测值)
一个琐碎的例子是,考虑模型<math>\textstyle f(x) = a</math> 其中 <math>\textstyle a</math> 是一个常数,损失函数为 <math>\textstyle C=E[(x - f(x))^2]</math>. 最小化这个损失产生了一个 <math>\textstyle a</math> 的值,它与数据均值相等。损失函数可以更加复杂。它的形式取决于应用:举个例子,在【压缩】中它可以与<math>\textstyle x</math> 和 <math>\textstyle f(x)</math>间的【交互信息】有关,在统计建模中,它可以与模型给出数据的【后验概率】有关。(注意在这两个例子中这些量应当被最大化而不是最小化)
无监督学习范式中的任务通常是【估计】问题;应用包括【聚类】,【统计分布】的估计,【压缩】和【滤波】。
====强化学习(Reinforcement learning)====
在【强化学习】中,数据<math>\textstyle x</math> 通常不被给出,而是由一个代理人与环境的交互生成。 在每个时间点 <math>\textstyle t</math>,,代理做出一个动作 <math>\textstyle y_t</math>,环境根据某种(通常未知)动力学产生一个观测值 <math>\textstyle x_t</math> ,和一个瞬时损失<math>\textstyle c_t</math>。目标是找到一个选择动作的方针,它最小化长期损失的某种衡量。例如,期望积累损失。环境的动力学和每种方针的长期损失通常未知,但可以估计。
更正式地说,环境被建模成【马尔科夫决策过程】 (MDP),具有如下概率分布的状态 <math>\textstyle {s_1,...,s_n}\in S </math>和动作 <math>\textstyle {a_1,...,a_m} \in A</math> :瞬时损失分布 <math>\textstyle P(c_t|s_t)</math>,观测分布 <math>\textstyle P(x_t|s_t)</math> 和转移 <math>\textstyle P(s_{t+1}|s_t, a_t)</math>, 方针被定义为给定观测值的动作上的条件分布。合起来,这二者定义了一个【马尔科夫链】(MC)。目标是找到最小化损失的方针(也就是MC)。
ANNs are frequently used in reinforcement learning as part of the overall algorithm.<ref>{{cite conference| author = Dominic, S. |author2=Das, R. |author3=Whitley, D. |author4=Anderson, C. |date=July 1991 | title = Genetic reinforcement learning for neural networks | conference = IJCNN-91-Seattle International Joint Conference on Neural Networks | booktitle = IJCNN-91-Seattle International Joint Conference on Neural Networks | publisher = IEEE | location = Seattle, Washington, USA | doi = 10.1109/IJCNN.1991.155315 | accessdate = | isbn = 0-7803-0164-1 }}</ref><ref>{{cite journal |last=Hoskins |first=J.C. |author2=Himmelblau, D.M. |title=Process control via artificial neural networks and reinforcement learning |journal=Computers & Chemical Engineering |year=1992 |volume=16 |pages=241–251 |doi=10.1016/0098-1354(92)80045-B |issue=4}}</ref> [[Dynamic programming]] was coupled with ANNs (giving neurodynamic programming) by [[Dimitri Bertsekas|Bertsekas]] and Tsitsiklis<ref>{{cite book|url=https://papers.nips.cc/paper/4741-deep-neural-networks-segment-neuronal-membranes-in-electron-microscopy-images|title=Neuro-dynamic programming|first=D.P.|first2=J.N.|publisher=Athena Scientific|year=1996|isbn=1-886529-10-8|location=|page=512|pages=|author=Bertsekas|author2=Tsitsiklis}}</ref> and applied to multi-dimensional nonlinear problems such as those involved in [[vehicle routing]],<ref>{{cite journal |last=Secomandi |first=Nicola |title=Comparing neuro-dynamic programming algorithms for the vehicle routing problem with stochastic demands |journal=Computers & Operations Research |year=2000 |volume=27 |pages=1201–1225 |doi=10.1016/S0305-0548(99)00146-X |issue=11–12}}</ref> [[natural resource management|natural resources management]]<ref>{{cite conference| author = de Rigo, D. |author2=Rizzoli, A. E. |author3=Soncini-Sessa, R. |author4=Weber, E. |author5=Zenesi, P. | year = 2001 | title = Neuro-dynamic programming for the efficient management of reservoir networks | conference = MODSIM 2001, International Congress on Modelling and Simulation | conferenceurl = http://www.mssanz.org.au/MODSIM01/MODSIM01.htm | booktitle = Proceedings of MODSIM 2001, International Congress on Modelling and Simulation | publisher = Modelling and Simulation Society of Australia and New Zealand | location = Canberra, Australia | doi = 10.5281/zenodo.7481 | url = https://zenodo.org/record/7482/files/de_Rigo_etal_MODSIM2001_activelink_authorcopy.pdf | accessdate = 29 July 2012 | isbn = 0-867405252 }}</ref><ref>{{cite conference| author = Damas, M. |author2=Salmeron, M. |author3=Diaz, A. |author4=Ortega, J. |author5=Prieto, A. |author6=Olivares, G.| year = 2000 | title = Genetic algorithms and neuro-dynamic programming: application to water supply networks | conference = 2000 Congress on Evolutionary Computation | booktitle = Proceedings of 2000 Congress on Evolutionary Computation | publisher = IEEE | location = La Jolla, California, USA | doi = 10.1109/CEC.2000.870269 | accessdate = | isbn = 0-7803-6375-2 }}</ref> or [[medicine]]<ref>{{cite journal |last=Deng |first=Geng |author2=Ferris, M.C. |title=Neuro-dynamic programming for fractionated radiotherapy planning |journal=Springer Optimization and Its Applications |year=2008 |volume=12 |pages=47–70 |doi=10.1007/978-0-387-73299-2_3|citeseerx=10.1.1.137.8288 |series=Springer Optimization and Its Applications |isbn=978-0-387-73298-5 }}</ref> because of the ability of ANNs to mitigate losses of accuracy even when reducing the discretization grid density for numerically approximating the solution of the original control problems.
ANNs are frequently used in reinforcement learning as part of the overall algorithm.<ref>{{cite conference| author = Dominic, S. |author2=Das, R. |author3=Whitley, D. |author4=Anderson, C. |date=July 1991 | title = Genetic reinforcement learning for neural networks | conference = IJCNN-91-Seattle International Joint Conference on Neural Networks | booktitle = IJCNN-91-Seattle International Joint Conference on Neural Networks | publisher = IEEE | location = Seattle, Washington, USA | doi = 10.1109/IJCNN.1991.155315 | accessdate = | isbn = 0-7803-0164-1 }}</ref><ref>{{cite journal |last=Hoskins |first=J.C. |author2=Himmelblau, D.M. |title=Process control via artificial neural networks and reinforcement learning |journal=Computers & Chemical Engineering |year=1992 |volume=16 |pages=241–251 |doi=10.1016/0098-1354(92)80045-B |issue=4}}</ref> [[Dynamic programming]] was coupled with ANNs (giving neurodynamic programming) by [[Dimitri Bertsekas|Bertsekas]] and Tsitsiklis<ref>{{cite book|url=https://papers.nips.cc/paper/4741-deep-neural-networks-segment-neuronal-membranes-in-electron-microscopy-images|title=Neuro-dynamic programming|first=D.P.|first2=J.N.|publisher=Athena Scientific|year=1996|isbn=1-886529-10-8|location=|page=512|pages=|author=Bertsekas|author2=Tsitsiklis}}</ref> and applied to multi-dimensional nonlinear problems such as those involved in [[vehicle routing]],<ref>{{cite journal |last=Secomandi |first=Nicola |title=Comparing neuro-dynamic programming algorithms for the vehicle routing problem with stochastic demands |journal=Computers & Operations Research |year=2000 |volume=27 |pages=1201–1225 |doi=10.1016/S0305-0548(99)00146-X |issue=11–12}}</ref> [[natural resource management|natural resources management]]<ref>{{cite conference| author = de Rigo, D. |author2=Rizzoli, A. E. |author3=Soncini-Sessa, R. |author4=Weber, E. |author5=Zenesi, P. | year = 2001 | title = Neuro-dynamic programming for the efficient management of reservoir networks | conference = MODSIM 2001, International Congress on Modelling and Simulation | conferenceurl = http://www.mssanz.org.au/MODSIM01/MODSIM01.htm | booktitle = Proceedings of MODSIM 2001, International Congress on Modelling and Simulation | publisher = Modelling and Simulation Society of Australia and New Zealand | location = Canberra, Australia | doi = 10.5281/zenodo.7481 | url = https://zenodo.org/record/7482/files/de_Rigo_etal_MODSIM2001_activelink_authorcopy.pdf | accessdate = 29 July 2012 | isbn = 0-867405252 }}</ref><ref>{{cite conference| author = Damas, M. |author2=Salmeron, M. |author3=Diaz, A. |author4=Ortega, J. |author5=Prieto, A. |author6=Olivares, G.| year = 2000 | title = Genetic algorithms and neuro-dynamic programming: application to water supply networks | conference = 2000 Congress on Evolutionary Computation | booktitle = Proceedings of 2000 Congress on Evolutionary Computation | publisher = IEEE | location = La Jolla, California, USA | doi = 10.1109/CEC.2000.870269 | accessdate = | isbn = 0-7803-6375-2 }}</ref> or [[medicine]]<ref>{{cite journal |last=Deng |first=Geng |author2=Ferris, M.C. |title=Neuro-dynamic programming for fractionated radiotherapy planning |journal=Springer Optimization and Its Applications |year=2008 |volume=12 |pages=47–70 |doi=10.1007/978-0-387-73299-2_3|citeseerx=10.1.1.137.8288 |series=Springer Optimization and Its Applications |isbn=978-0-387-73298-5 }}</ref> because of the ability of ANNs to mitigate losses of accuracy even when reducing the discretization grid density for numerically approximating the solution of the original control problems.