更改
跳到导航
跳到搜索
第163行:
第163行:
− +
− This is a learning method specially designed for [[cerebellar model articulation controller]] (CMAC) neural networks. In 2004 a recursive least squares algorithm was introduced to train [[cerebellar model articulation controller|CMAC]] neural network online.<ref name=Qin1>Ting Qin, et al. "A learning algorithm of CMAC based on RLS." Neural Processing Letters 19.1 (2004): 49–61.</ref> This algorithm can converge in one step and update all weights in one step with any new input data. Initially, this algorithm had [[Computational complexity theory|computational complexity]] of ''O''(''N''<sup>3</sup>). Based on [[QR decomposition]], this recursive learning algorithm was simplified to be ''O''(''N'').<ref name=Qin2>Ting Qin, et al. "Continuous CMAC-QRLS and its systolic array." Neural Processing Letters 22.1 (2005): 1–16.</ref>+
− +
− {{See also|Machine learning}}+
+
+
+
+
+
+
+
+
+
+
− Training a neural network model essentially means selecting one model from the set of allowed models (or, in a [[Bayesian probability|Bayesian]] framework, determining a distribution over the set of allowed models) that minimizes the cost. Numerous algorithms are available for training neural network models; most of them can be viewed as a straightforward application of [[Mathematical optimization|optimization]] theory and [[statistical estimation]].+
+
− Most employ some form of [[gradient descent]], using backpropagation to compute the actual gradients. This is done by simply taking the derivative of the cost function with respect to the network parameters and then changing those parameters in a [[gradient-related]] direction. Backpropagation training algorithms fall into three categories: +
− * [[Gradient descent|steepest descent]] (with variable learning rate and [[Gradient descent#The momentum method|momentum]], [[Rprop|resilient backpropagation]]);
− * quasi-Newton ([[Broyden–Fletcher–Goldfarb–Shanno algorithm|Broyden-Fletcher-Goldfarb-Shanno]], [[Secant method|one step secant]]);
− * [[Levenberg–Marquardt algorithm|Levenberg-Marquardt]] and [[Conjugate gradient method|conjugate gradient]] (Fletcher-Reeves update, Polak-Ribiére update, Powell-Beale restart, scaled conjugate gradient).<ref>{{cite conference|author1=M. Forouzanfar |author2=H. R. Dajani |author3=V. Z. Groza |author4=M. Bolic |author5=S. Rajan |last-author-amp=yes |date=July 2010 | title = Comparison of Feed-Forward Neural Network Training Algorithms for Oscillometric Blood Pressure Estimation | conference = 4th Int. Workshop Soft Computing Applications | publisher = IEEE| location = Arad, Romania |url=https://www.researchgate.net/profile/Mohamad_Forouzanfar/publication/224173336_Comparison_of_Feed-Forward_Neural_Network_training_algorithms_for_oscillometric_blood_pressure_estimation/links/00b7d533829c3a7484000000.pdf?ev=pub_int_doc_dl&origin=publication_detail&inViewer=true&msrp=TyT96%2BjWOHJo%2BVhkMF4IzwHPAImSd442n%2BAkEuXj9qBmQSZ495CpxqlaOYon%2BSlEzWQElBGyJmbBCiiUOV8ImeEqPFXiIRivcrWsWmlPBYU%3D }}</ref>
− [[Evolutionary methods]],<ref>{{cite conference| author1 = de Rigo, D. | author2 = Castelletti, A. | author3 = Rizzoli, A. E. | author4 = Soncini-Sessa, R. | author5 = Weber, E. |date=January 2005 | title = A selective improvement technique for fastening Neuro-Dynamic Programming in Water Resources Network Management | conference = 16th IFAC World Congress | conferenceurl = http://www.nt.ntnu.no/users/skoge/prost/proceedings/ifac2005/Index.html | booktitle = Proceedings of the 16th IFAC World Congress – IFAC-PapersOnLine | editor = Pavel Zítek | volume = 16 | publisher = IFAC | location = Prague, Czech Republic | url = http://www.nt.ntnu.no/users/skoge/prost/proceedings/ifac2005/Papers/Paper4269.html+
− | accessdate = 30 December 2011 | doi = 10.3182/20050703-6-CZ-1902.02172 | isbn = 978-3-902661-75-3 }}</ref> [[gene expression programming]],<ref>{{cite web|last=Ferreira|first=C.|year=2006|title=Designing Neural Networks Using Gene Expression Programming|url= http://www.gene-expression-programming.com/webpapers/Ferreira-ASCT2006.pdf|publisher= In A. Abraham, B. de Baets, M. Köppen, and B. Nickolay, eds., Applied Soft Computing Technologies: The Challenge of Complexity, pages 517–536, Springer-Verlag}}</ref> [[simulated annealing]],<ref>{{cite conference| author = Da, Y. |author2=Xiurun, G. |date=July 2005 | title = An improved PSO-based ANN with simulated annealing technique | conference = New Aspects in Neurocomputing: 11th European Symposium on Artificial Neural Networks | conferenceurl = http://www.dice.ucl.ac.be/esann/proceedings/electronicproceedings.htm | editor = T. Villmann | publisher = Elsevier | doi = 10.1016/j.neucom.2004.07.002 }}<!--| accessdate = 30 December 2011 --></ref> [[expectation-maximization]], [[non-parametric methods]] and [[particle swarm optimization]]<ref>{{cite conference| author = Wu, J. |author2=Chen, E. |date=May 2009 | title = A Novel Nonparametric Regression Ensemble for Rainfall Forecasting Using Particle Swarm Optimization Technique Coupled with Artificial Neural Network | conference = 6th International Symposium on Neural Networks, ISNN 2009 | conferenceurl = http://www2.mae.cuhk.edu.hk/~isnn2009/ | editors = Wang, H., Shen, Y., Huang, T., Zeng, Z. | publisher = Springer | doi = 10.1007/978-3-642-01513-7-6 | isbn = 978-3-642-01215-0 }}<!--| accessdate = 1 January 2012 --></ref> are other methods for training neural networks.
− == Variants ==+
+
+
+
− === Group method of data handling ===
− {{Main|Group method of data handling}}The Group Method of Data Handling (GMDH)<ref name="ivak1968">{{cite journal|year=1968|title=The [[group method of data handling]] – a rival of the method of stochastic approximation|url=|journal=Soviet Automatic Control|volume=13|issue=3|pages=43–55|last1=Ivakhnenko|first1=Alexey Grigorevich|authorlink=Alexey Grigorevich Ivakhnenko}}</ref> features fully automatic structural and parametric model optimization. The node activation functions are [[Andrey Kolmogorov|Kolmogorov]]-Gabor polynomials that permit additions and multiplications. It used a deep feedforward multilayer perceptron with eight layers.<ref name="ivak1971">{{Cite journal|last=Ivakhnenko|first=Alexey|date=1971|title=Polynomial theory of complex systems|url=|journal=IEEE Transactions on Systems, Man and Cybernetics (4)|issue=4|pages=364–378|doi=10.1109/TSMC.1971.4308320|pmid=|access-date=}}</ref> It is a [[supervised learning]] network that grows layer by layer, where each layer is trained by [[regression analysis]]. Useless items are detected using a validation set, and pruned through [[Regularization (mathematics)|regularization]]. The size and depth of the resulting network depends on the task.<ref name="kondo2008">{{cite journal|last2=Ueno|first2=J.|date=|year=2008|title=Multi-layered GMDH-type neural network self-selecting optimum neural network architecture and its application to 3-dimensional medical image recognition of blood vessels|url=https://www.researchgate.net/publication/228402366_GMDH-Type_Neural_Network_Self-Selecting_Optimum_Neural_Network_Architecture_and_Its_Application_to_3-Dimensional_Medical_Image_Recognition_of_the_Lungs|journal=International Journal of Innovative Computing, Information and Control|volume=4|issue=1|pages=175–187|via=|last1=Kondo|first1=T.}}</ref>
− +
− {{main|Convolutional neural network}}A convolutional neural network (CNN) is a class of deep, feed-forward networks, composed of one or more [[convolution]]al layers with fully connected layers (matching those in typical ANNs) on top. It uses tied weights and pooling layers. In particular, max-pooling<ref name="Weng19932"/> is often structured via Fukushima's convolutional architecture.<ref name="FUKU1980">{{cite journal|year=1980|title=Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position|url=|journal=Biol. Cybern.|volume=36|issue=4|pages=193–202|doi=10.1007/bf00344251|pmid=7370364|last1=Fukushima|first1=K.}}</ref> This architecture allows CNNs to take advantage of the 2D structure of input data.+
− CNNs are suitable for processing visual and other two-dimensional data.<ref name="LECUN1989">LeCun ''et al.'', "Backpropagation Applied to Handwritten Zip Code Recognition," ''Neural Computation'', 1, pp. 541–551, 1989.</ref><ref name="lecun2016slides">[[Yann LeCun]] (2016). Slides on Deep Learning [https://indico.cern.ch/event/510372/ Online]</ref> They have shown superior results in both image and speech applications. They can be trained with standard backpropagation. CNNs are easier to train than other regular, deep, feed-forward neural networks and have many fewer parameters to estimate.<ref name="STANCNN">{{cite web|url=http://ufldl.stanford.edu/tutorial/supervised/ConvolutionalNeuralNetwork/|title=Unsupervised Feature Learning and Deep Learning Tutorial|publisher=}}</ref> Examples of applications in computer vision include [[DeepDream]]<ref name="deepdream">{{cite journal|last2=Liu|first2=Wei|last3=Jia|first3=Yangqing|last4=Sermanet|first4=Pierre|last5=Reed|first5=Scott|last6=Anguelov|first6=Dragomir|last7=Erhan|first7=Dumitru|last8=Vanhoucke|first8=Vincent|last9=Rabinovich|first9=Andrew|date=|year=2014|title=Going Deeper with Convolutions|url=|journal=Computing Research Repository|volume=|pages=1|arxiv=1409.4842|doi=10.1109/CVPR.2015.7298594|via=|first1=Christian|last1=Szegedy|isbn=978-1-4673-6964-0}}</ref> and [[robot navigation]].<ref>{{cite journal | last=Ran | first=Lingyan | last2=Zhang | first2=Yanning | last3=Zhang | first3=Qilin | last4=Yang | first4=Tao | title=Convolutional Neural Network-Based Robot Navigation Using Uncalibrated Spherical Images | journal=Sensors | publisher=MDPI AG | volume=17 | issue=6 | date=2017-06-12 | issn=1424-8220 | doi=10.3390/s17061341 | page=1341 | url=https://qilin-zhang.github.io/_pages/pdfs/sensors-17-01341.pdf}}</ref>+
− === Long short-term memory ===+
− {{main|Long short-term memory}}Long short-term memory (LSTM) networks are RNNs that avoid the [[vanishing gradient problem]].<ref name=":03">{{Cite journal|last=Hochreiter|first=Sepp|author-link=Sepp Hochreiter|last2=Schmidhuber|first2=Jürgen|author-link2=Jürgen Schmidhuber|date=1997-11-01|title=Long Short-Term Memory|url=http://www.mitpressjournals.org/doi/10.1162/neco.1997.9.8.1735|journal=Neural Computation|volume=9|issue=8|pages=1735–1780|doi=10.1162/neco.1997.9.8.1735|issn=0899-7667|via=}}</ref> LSTM is normally augmented by recurrent gates called forget gates.<ref name=":10">{{Cite web|url=https://www.researchgate.net/publication/220320057_Learning_Precise_Timing_with_LSTM_Recurrent_Networks|title=Learning Precise Timing with LSTM Recurrent Networks (PDF Download Available)|website=ResearchGate|language=en|access-date=2017-06-13|pp=115–143}}</ref> LSTM networks prevent backpropagated errors from vanishing or exploding.<ref name="HOCH19912"/> Instead errors can flow backwards through unlimited numbers of virtual layers in space-unfolded LSTM. That is, LSTM can learn "very deep learning" tasks<ref name="SCHIDHUB2" /> that require memories of events that happened thousands or even millions of discrete time steps ago. Problem-specific LSTM-like topologies can be evolved.<ref>{{Cite journal|last=Bayer|first=Justin|last2=Wierstra|first2=Daan|last3=Togelius|first3=Julian|last4=Schmidhuber|first4=Jürgen|date=2009-09-14|title=Evolving Memory Cell Structures for Sequence Learning|url=https://link.springer.com/chapter/10.1007/978-3-642-04277-5_76|journal=Artificial Neural Networks – ICANN 2009|volume=5769|language=en|publisher=Springer, Berlin, Heidelberg|pages=755–764|doi=10.1007/978-3-642-04277-5_76|series=Lecture Notes in Computer Science|isbn=978-3-642-04276-8}}</ref> LSTM can handle long delays and signals that have a mix of low and high frequency components.+
+
+
− Stacks of LSTM RNNs<ref>{{Cite journal|last=Fernández|first=Santiago|last2=Graves|first2=Alex|last3=Schmidhuber|first3=Jürgen|date=2007|title=Sequence labelling in structured domains with hierarchical recurrent neural networks|url=http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.79.1887|journal=In Proc. 20th Int. Joint Conf. on Artificial In℡ligence, Ijcai 2007|pages=774–779}}</ref> trained by Connectionist Temporal Classification (CTC)<ref name=":12">{{Cite journal|last=Graves|first=Alex|last2=Fernández|first2=Santiago|last3=Gomez|first3=Faustino|date=2006|title=Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks|url=http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.75.6306|journal=In Proceedings of the International Conference on Machine Learning, ICML 2006|pages=369–376}}</ref> can find an RNN weight matrix that maximizes the probability of the label sequences in a training set, given the corresponding input sequences. CTC achieves both alignment and recognition.+
+
+
+
+
+
− In 2003, LSTM started to become competitive with traditional speech recognizers.<ref name="graves2003">{{Cite web|url=Ftp://ftp.idsia.ch/pub/juergen/bioadit2004.pdf|title=Biologically Plausible Speech Recognition with LSTM Neural Nets|last=Graves|first=Alex|last2=Eck|first2=Douglas|date=2003|website=1st Intl. Workshop on Biologically Inspired Approaches to Advanced Information Technology, Bio-ADIT 2004, Lausanne, Switzerland|pages=175–184|archive-url=|archive-date=|dead-url=|access-date=|last3=Beringer|first3=Nicole|last4=Schmidhuber|first4=Jürgen|authorlink4=Jürgen Schmidhuber}}</ref> In 2007, the combination with CTC achieved first good results on speech data.<ref name="fernandez2007keyword">{{Cite journal|last=Fernández|first=Santiago|last2=Graves|first2=Alex|last3=Schmidhuber|first3=Jürgen|date=2007|title=An Application of Recurrent Neural Networks to Discriminative Keyword Spotting|url=http://dl.acm.org/citation.cfm?id=1778066.1778092|journal=Proceedings of the 17th International Conference on Artificial Neural Networks|series=ICANN'07|location=Berlin, Heidelberg|publisher=Springer-Verlag|pages=220–229|isbn=3540746935}}</ref> In 2009, a CTC-trained LSTM was the first RNN to win pattern recognition contests, when it won several competitions in connected [[handwriting recognition]].<ref name="SCHIDHUB2" /><ref name="graves20093"/> In 2014, [[Baidu]] used CTC-trained RNNs to break the Switchboard Hub5'00 speech recognition benchmark, without traditional speech processing methods.<ref name="hannun2014">{{cite arxiv|last=Hannun|first=Awni|last2=Case|first2=Carl|last3=Casper|first3=Jared|last4=Catanzaro|first4=Bryan|last5=Diamos|first5=Greg|last6=Elsen|first6=Erich|last7=Prenger|first7=Ryan|last8=Satheesh|first8=Sanjeev|last9=Sengupta|first9=Shubho|date=2014-12-17|title=Deep Speech: Scaling up end-to-end speech recognition|eprint=1412.5567|class=cs.CL}}</ref> LSTM also improved large-vocabulary speech recognition,<ref name="sak2014">{{Cite web|url=https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43905.pdf|title=Long Short-Term Memory recurrent neural network architectures for large scale acoustic modeling|last=Sak|first=Hasim|last2=Senior|first2=Andrew|date=2014|website=|archive-url=|archive-date=|dead-url=|access-date=|last3=Beaufays|first3=Francoise}}</ref><ref name="liwu2015">{{cite arxiv|last=Li|first=Xiangang|last2=Wu|first2=Xihong|date=2014-10-15|title=Constructing Long Short-Term Memory based Deep Recurrent Neural Networks for Large Vocabulary Speech Recognition|eprint=1410.4281|class=cs.CL}}</ref> text-to-speech synthesis,<ref>{{Cite web|url=https://www.researchgate.net/publication/287741874_TTS_synthesis_with_bidirectional_LSTM_based_Recurrent_Neural_Networks|title=TTS synthesis with bidirectional LSTM based Recurrent Neural Networks|last=Fan|first=Y.|last2=Qian|first2=Y.|date=2014|website=ResearchGate|language=en|archive-url=|archive-date=|dead-url=|access-date=2017-06-13|last3=Xie|first3=F.|last4=Soong|first4=F. K.}}</ref> for [[Google Android]],<ref name="scholarpedia2"/><ref name="zen2015">{{Cite web|url=https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43266.pdf|title=Unidirectional Long Short-Term Memory Recurrent Neural Network with Recurrent Output Layer for Low-Latency Speech Synthesis|last=Zen|first=Heiga|last2=Sak|first2=Hasim|date=2015|website=Google.com|publisher=ICASSP|pages=4470–4474|archive-url=|archive-date=|dead-url=|access-date=}}</ref> and photo-real talking heads.<ref name="fan2015">{{Cite journal|last=Fan|first=Bo|last2=Wang|first2=Lijuan|last3=Soong|first3=Frank K.|last4=Xie|first4=Lei|date=2015|title=Photo-Real Talking Head with Deep Bidirectional LSTM|url=https://www.microsoft.com/en-us/research/wp-content/uploads/2015/04/icassp2015_fanbo_1009.pdf|journal=Proceedings of ICASSP|volume=|pages=|via=}}</ref> In 2015, Google's speech recognition experienced a 49% improvement through CTC-trained LSTM.<ref name="sak2015">{{Cite web|url=http://googleresearch.blogspot.ch/2015/09/google-voice-search-faster-and-more.html|title=Google voice search: faster and more accurate|last=Sak|first=Haşim|last2=Senior|first2=Andrew|date=September 2015|website=|archive-url=|archive-date=|dead-url=|access-date=|last3=Rao|first3=Kanishka|last4=Beaufays|first4=Françoise|last5=Schalkwyk|first5=Johan}}</ref>+
+
− LSTM became popular in [[Natural Language Processing]]. Unlike previous models based on [[Hidden Markov model|HMMs]] and similar concepts, LSTM can learn to recognise [[context-sensitive languages]].<ref name="gers2001">{{cite journal|last2=Schmidhuber|first2=Jürgen|year=2001|title=LSTM Recurrent Networks Learn Simple Context Free and Context Sensitive Languages|url=|journal=IEEE Transactions on Neural Networks|volume=12|issue=6|pages=1333–1340|doi=10.1109/72.963769|last1=Gers|first1=Felix A.|authorlink2=Jürgen Schmidhuber}}</ref> LSTM improved machine translation,<ref>{{cite web | last=Huang | first=Jie | last2=Zhou | first2=Wengang | last3=Zhang | first3=Qilin | last4=Li | first4=Houqiang | last5=Li | first5=Weiping | title=Video-based Sign Language Recognition without Temporal Segmentation | eprint=1801.10111 | date=2018-01-30 | url=https://arxiv.org/pdf/1801.10111.pdf}}</ref><ref name="NIPS2014">{{Cite journal|last=Sutskever|first=L.|last2=Vinyals|first2=O.|last3=Le|first3=Q.|date=2014|title=Sequence to Sequence Learning with Neural Networks|url=https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf|journal=NIPS'14 Proceedings of the 27th International Conference on Neural Information Processing Systems |volume=2 |pages=3104–3112 |bibcode=2014arXiv1409.3215S |arxiv=1409.3215 |class=cs.CL}}</ref> [[language modeling]]<ref name="vinyals2016">{{cite arxiv|last=Jozefowicz|first=Rafal|last2=Vinyals|first2=Oriol|last3=Schuster|first3=Mike|last4=Shazeer|first4=Noam|last5=Wu|first5=Yonghui|date=2016-02-07|title=Exploring the Limits of Language Modeling|eprint=1602.02410|class=cs.CL}}</ref> and multilingual language processing.<ref name="gillick2015">{{cite arxiv|last=Gillick|first=Dan|last2=Brunk|first2=Cliff|last3=Vinyals|first3=Oriol|last4=Subramanya|first4=Amarnag|date=2015-11-30|title=Multilingual Language Processing From Bytes|eprint=1512.00103|class=cs.CL}}</ref> LSTM combined with CNNs improved automatic image captioning.<ref name="vinyals2015">{{cite arxiv|last=Vinyals|first=Oriol|last2=Toshev|first2=Alexander|last3=Bengio|first3=Samy|last4=Erhan|first4=Dumitru|date=2014-11-17|title=Show and Tell: A Neural Image Caption Generator|eprint=1411.4555|class=cs.CV}}</ref>+
+
+
+
+
+
+
− +
− {{Main|Reservoir computing}}Deep Reservoir Computing and Deep Echo State Networks (deepESNs)<ref>{{Cite journal|last=Gallicchio|first=Claudio|last2=Micheli|first2=Alessio|last3=Pedrelli|first3=Luca|title=Deep reservoir computing: A critical experimental analysis|url=http://www.sciencedirect.com/science/article/pii/S0925231217307567|journal=Neurocomputing|volume=268|pages=87|doi=10.1016/j.neucom.2016.12.089|year=2017}}</ref><ref>{{Cite journal|last=Gallicchio|first=Claudio|last2=Micheli|first2=Alessio|date=|title=Echo State Property of Deep Reservoir Computing Networks|url=https://link.springer.com/article/10.1007/s12559-017-9461-9|journal=Cognitive Computation|language=en|volume=9|issue=3|pages=337–350|doi=10.1007/s12559-017-9461-9|issn=1866-9956|via=|year=2017}}</ref> provide a framework for efficiently trained models for hierarchical processing of temporal data, while enabling the investigation of the inherent role of RNN layered composition.{{clarify|date=January 2018}}+
− +
− === Deep belief networks ===
− {{main|Deep belief network}}
− [[File:Restricted_Boltzmann_machine.svg|thumb|A [[restricted Boltzmann machine]] (RBM) with fully connected visible and hidden units. Note there are no hidden-hidden or visible-visible connections.]]
− A deep belief network (DBN) is a probabilistic, [[generative model]] made up of multiple layers of hidden units. It can be considered a [[Function composition|composition]] of simple learning modules that make up each layer.<ref name="SCHOLARDBNS">{{cite journal|year=2009|title=Deep belief networks|url=|journal=Scholarpedia|volume=4|issue=5|page=5947|doi=10.4249/scholarpedia.5947|last1=Hinton|first1=G.E.|bibcode=2009SchpJ...4.5947H}}</ref>
−
− A DBN can be used to generatively pre-train a DNN by using the learned DBN weights as the initial DNN weights. Backpropagation or other discriminative algorithms can then tune these weights. This is particularly helpful when training data are limited, because poorly initialized weights can significantly hinder model performance. These pre-trained weights are in a region of the weight space that is closer to the optimal weights than were they randomly chosen. This allows for both improved modeling and faster convergence of the fine-tuning phase.<ref>{{Cite journal|last=Larochelle|first=Hugo|last2=Erhan|first2=Dumitru|last3=Courville|first3=Aaron|last4=Bergstra|first4=James|last5=Bengio|first5=Yoshua|date=2007|title=An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation|url=http://doi.acm.org/10.1145/1273496.1273556|journal=Proceedings of the 24th International Conference on Machine Learning|series=ICML '07|location=New York, NY, USA|publisher=ACM|pages=473–480|doi=10.1145/1273496.1273556|isbn=9781595937933}}</ref>
−
− === Large memory storage and retrieval neural networks ===
− Large memory storage and retrieval neural networks (LAMSTAR)<ref name="book2013">{{cite book|url={{google books |plainurl=y |id=W6W6CgAAQBAJ&pg=PP1}}|title=Principles of Artificial Neural Networks|last=Graupe|first=Daniel|publisher=World Scientific|year=2013|isbn=978-981-4522-74-8|location=|pages=1–|ref=harv}}</ref><ref name="GrPatent">{{Patent|US|5920852 A|D. Graupe," Large memory storage and retrieval (LAMSTAR) network, April 1996}}</ref> are fast deep learning neural networks of many layers that can use many filters simultaneously. These filters may be nonlinear, stochastic, logic, [[non-stationary]], or even non-analytical. They are biologically motivated and learn continuously.
−
− A LAMSTAR neural network may serve as a dynamic neural network in spatial or time domains or both. Its speed is provided by [[Hebbian]] link-weights<ref name=book2013a>D. Graupe, "Principles of Artificial Neural Networks.3rd Edition", World Scientific Publishers, 2013, pp. 203–274.</ref> that integrate the various and usually different filters (preprocessing functions) into its many layers and to dynamically rank the significance of the various layers and functions relative to a given learning task. This grossly imitates biological learning which integrates various preprocessors ([[cochlea]], [[retina]], ''etc.'') and cortexes ([[Auditory cortex|auditory]], [[Visual cortex|visual]], ''etc.'') and their various regions. Its deep learning capability is further enhanced by using inhibition, correlation and its ability to cope with incomplete data, or "lost" neurons or layers even amidst a task. It is fully transparent due to its link weights. The link-weights allow dynamic determination of innovation and redundancy, and facilitate the ranking of layers, of filters or of individual neurons relative to a task.
−
− LAMSTAR has been applied to many domains, including medical<ref>{{Cite journal|last=Nigam|first=Vivek Prakash|last2=Graupe|first2=Daniel|date=2004-01-01|title=A neural-network-based detection of epilepsy|journal=Neurological Research|volume=26|issue=1|pages=55–60|doi=10.1179/016164104773026534|issn=0161-6412|pmid=14977058}}</ref><ref name=":11">{{Cite journal|last=Waxman|first=Jonathan A.|last2=Graupe|first2=Daniel|last3=Carley|first3=David W.|date=2010-04-01|title=Automated Prediction of Apnea and Hypopnea, Using a LAMSTAR Artificial Neural Network|url=http://www.atsjournals.org/doi/abs/10.1164/rccm.200907-1146OC|journal=American Journal of Respiratory and Critical Care Medicine|volume=181|issue=7|pages=727–733|doi=10.1164/rccm.200907-1146oc|issn=1073-449X}}</ref><ref name="GrGrZh">{{cite journal|last2=Graupe|first2=M. H.|last3=Zhong|first3=Y.|last4=Jackson|first4=R. K.|year=2008|title=Blind adaptive filtering for non-invasive extraction of the fetal electrocardiogram and its non-stationarities|url=|journal=Proc. Inst. Mech Eng., UK, Part H: Journal of Engineering in Medicine|volume=222|issue=8|pages=1221–1234|doi=10.1243/09544119jeim417|last1=Graupe|first1=D.}}</ref> and financial predictions,<ref name="book2013b">{{harvnb|Graupe|2013|pp=240–253}}</ref> adaptive filtering of noisy speech in unknown noise,<ref name="GrAbon">{{cite journal|last2=Abon|first2=J.|year=2002|title=A Neural Network for Blind Adaptive Filtering of Unknown Noise from Speech|url=https://www.tib.eu/en/search/id/BLCP:CN019373941/Blind-Adaptive-Filtering-of-Speech-from-Noise-of/|journal=Intelligent Engineering Systems Through Artificial Neural Networks|language=en|publisher=Technische Informationsbibliothek (TIB)|volume=12|issue=|pages=683–688|last1=Graupe|first1=D.|accessdate=2017-06-14}}</ref> still-image recognition,<ref name="book2013c">D. Graupe, "Principles of Artificial Neural Networks.3rd Edition", World Scientific Publishers", 2013, pp. 253–274.</ref> video image recognition,<ref name="Girado">{{cite journal|last2=Sandin|first2=D. J.|last3=DeFanti|first3=T. A.|year=2003|title=Real-time camera-based face detection using a modified LAMSTAR neural network system|url=|journal=Proc. SPIE 5015, Applications of Artificial Neural Networks in Image Processing VIII|volume=5015|issue=|pages=36|page=|doi=10.1117/12.477405|last1=Girado|first1=J. I.|series=Applications of Artificial Neural Networks in Image Processing VIII|bibcode=2003SPIE.5015...36G}}</ref> software security<ref name="VenkSel">{{cite journal|last2=Selvan|first2=S.|year=2007|title=Intrusion Detection using an Improved Competitive Learning Lamstar Network|url=|journal=International Journal of Computer Science and Network Security|volume=7|issue=2|pages=255–263|last1=Venkatachalam|first1=V}}</ref> and adaptive control of non-linear systems.<ref>{{Cite web|url=https://www.researchgate.net/publication/262316982_Control_of_unstable_nonlinear_and_nonstationary_systems_using_LAMSTAR_neural_networks|title=Control of unstable nonlinear and nonstationary systems using LAMSTAR neural networks|last=Graupe|first=D.|last2=Smollack|first2=M.|date=2007|website=ResearchGate|publisher=Proceedings of 10th IASTED on Intelligent Control, Sect.592,|pages=141–144|language=en|archive-url=|archive-date=|dead-url=|access-date=2017-06-14}}</ref> LAMSTAR had a much faster learning speed and somewhat lower error rate than a CNN based on [[ReLU]]-function filters and max pooling, in 20 comparative studies.<ref name="book1016">{{cite book|url={{google books |plainurl=y |id=e5hIDQAAQBAJ|page=57}}|title=Deep Learning Neural Networks: Design and Case Studies|last=Graupe|first=Daniel|date=7 July 2016|publisher=World Scientific Publishing Co Inc|year=|isbn=978-981-314-647-1|location=|pages=57–110}}</ref>
−
− These applications demonstrate delving into aspects of the data that are hidden from shallow learning networks and the human senses, such as in the cases of predicting onset of [[sleep apnea]] events,<ref name=":11" /> of an electrocardiogram of a fetus as recorded from skin-surface electrodes placed on the mother's abdomen early in pregnancy,<ref name="GrGrZh" /> of financial prediction<ref name="book2013" /> or in blind filtering of noisy speech.<ref name="GrAbon" />
−
− LAMSTAR was proposed in 1996 ({{US Patent|5920852 A}}) and was further developed Graupe and Kordylewski from 1997–2002.<ref>{{Cite journal|last=Graupe|first=D.|last2=Kordylewski|first2=H.|date=August 1996|title=Network based on SOM (Self-Organizing-Map) modules combined with statistical decision tools|url=http://ieeexplore.ieee.org/document/594203/|journal=Proceedings of the 39th Midwest Symposium on Circuits and Systems|volume=1|pages=471–474 vol.1|doi=10.1109/mwscas.1996.594203|isbn=0-7803-3636-4}}</ref><ref>{{Cite journal|last=Graupe|first=D.|last2=Kordylewski|first2=H.|date=1998-03-01|title=A Large Memory Storage and Retrieval Neural Network for Adaptive Retrieval and Diagnosis|url=http://www.worldscientific.com/doi/abs/10.1142/S0218194098000091|journal=International Journal of Software Engineering and Knowledge Engineering|volume=08|issue=1|pages=115–138|doi=10.1142/s0218194098000091|issn=0218-1940}}</ref><ref name="Kordylew">{{cite journal|last2=Graupe|first2=D|last3=Liu|first3=K.|year=2001|title=A novel large-memory neural network as an aid in medical diagnosis applications|url=|journal=IEEE Transactions on Information Technology in Biomedicine|volume=5|issue=3|pages=202–209|doi=10.1109/4233.945291|last1=Kordylewski|first1=H.}}</ref> A modified version, known as LAMSTAR 2, was developed by Schneider and Graupe in 2008.<ref name="Schn">{{cite journal|last2=Graupe|year=2008|title=A modified LAMSTAR neural network and its applications|url=|journal=International journal of neural systems|volume=18|issue=4|pages=331–337|doi=10.1142/s0129065708001634|last1=Schneider|first1=N.C.}}</ref><ref name="book2013d">{{harvnb|Graupe|2013|p=217}}</ref>
−
− === Stacked (de-noising) auto-encoders ===
− The [[auto encoder]] idea is motivated by the concept of a ''good'' representation. For example, for a [[Linear classifier|classifier]], a good representation can be defined as one that yields a better-performing classifier.
−
− An ''encoder'' is a deterministic mapping <math>f_\theta</math> that transforms an input vector''''' x''''' into hidden representation '''''y''''', where <math>\theta = \{\boldsymbol{W}, b\}</math>, <math>\boldsymbol{W}</math> is the weight matrix and '''b''' is an offset vector (bias). A ''decoder'' maps back the hidden representation '''y''' to the reconstructed input '''''z''''' via <math>g_\theta</math>. The whole process of auto encoding is to compare this reconstructed input to the original and try to minimize the error to make the reconstructed value as close as possible to the original.
−
− In ''stacked denoising auto encoders'', the partially corrupted output is cleaned (de-noised). This idea was introduced in 2010 by Vincent et al.<ref name="ref9">{{cite journal|last2=Larochelle|first2=Hugo|last3=Lajoie|first3=Isabelle|last4=Bengio|first4=Yoshua|last5=Manzagol|first5=Pierre-Antoine|date=2010|title=Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion|url=http://dl.acm.org/citation.cfm?id=1953039|journal=The Journal of Machine Learning Research|volume=11|pages=3371–3408|last1=Vincent|first1=Pascal}}</ref> with a specific approach to ''good'' representation, a ''good representation'' is one that can be obtained [[Robustness (computer science)|robustly]] from a corrupted input and that will be useful for recovering the corresponding clean input''.'' Implicit in this definition are the following ideas:
− * The higher level representations are relatively stable and robust to input corruption;
− * It is necessary to extract features that are useful for representation of the input distribution.
− The algorithm starts by a stochastic mapping of <math>\boldsymbol{x}</math> to <math>\tilde{\boldsymbol{x}}</math> through <math>q_D(\tilde{\boldsymbol{x}}|\boldsymbol{x})</math>, this is the corrupting step. Then the corrupted input <math>\tilde{\boldsymbol{x}}</math> passes through a basic auto-encoder process and is mapped to a hidden representation <math>\boldsymbol{y} = f_\theta(\tilde{\boldsymbol{x}}) = s(\boldsymbol{W}\tilde{\boldsymbol{x}}+b)</math>. From this hidden representation, we can reconstruct <math>\boldsymbol{z} = g_\theta(\boldsymbol{y})</math>. In the last stage, a minimization algorithm runs in order to have '''''z''''' as close as possible to uncorrupted input <math>\boldsymbol{x}</math>. The reconstruction error <math>L_H(\boldsymbol{x},\boldsymbol{z})</math> might be either the [[cross-entropy]] loss with an affine-sigmoid decoder, or the squared error loss with an [[Affine transformation|affine]] decoder.<ref name="ref9" />
−
− In order to make a deep architecture, auto encoders stack.<ref name="ballard1987">{{Cite web|url=http://www.aaai.org/Papers/AAAI/1987/AAAI87-050.pdf|title=Modular learning in neural networks|last=Ballard|first=Dana H.|date=1987|website=Proceedings of AAAI|pages=279–284|archive-url=|archive-date=|dead-url=|access-date=}}</ref> Once the encoding function <math>f_\theta</math> of the first denoising auto encoder is learned and used to uncorrupt the input (corrupted input), the second level can be trained.<ref name="ref9" />
−
− Once the stacked auto encoder is trained, its output can be used as the input to a [[supervised learning]] algorithm such as [[support vector machine]] classifier or a multi-class [[logistic regression]].<ref name="ref9" />
−
− === Deep stacking networks ===
− A deep stacking network (DSN)<ref name="ref17">{{cite journal|last2=Yu|first2=Dong|last3=Platt|first3=John|date=2012|title=Scalable stacking and learning for building deep architectures|url=http://research-srv.microsoft.com/pubs/157586/DSN-ICASSP2012.pdf|journal=2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)|pages=2133–2136|last1=Deng|first1=Li}}</ref> (deep convex network) is based on a hierarchy of blocks of simplified neural network modules. It was introduced in 2011 by Deng and Dong.<ref name="ref16">{{cite journal|last2=Yu|first2=Dong|date=2011|title=Deep Convex Net: A Scalable Architecture for Speech Pattern Classification|url=http://www.truebluenegotiations.com/files/deepconvexnetwork-interspeech2011-pub.pdf|journal=Proceedings of the Interspeech|pages=2285–2288|last1=Deng|first1=Li}}</ref> It formulates the learning as a [[convex optimization problem]] with a [[Closed-form expression|closed-form solution]], emphasizing the mechanism's similarity to [[Ensemble learning|stacked generalization]].<ref name="ref18">{{cite journal|date=1992|title=Stacked generalization|journal=Neural Networks|volume=5|issue=2|pages=241–259|doi=10.1016/S0893-6080(05)80023-1|last1=David|first1=Wolpert}}</ref> Each DSN block is a simple module that is easy to train by itself in a [[Supervised learning|supervised]] fashion without backpropagation for the entire blocks.<ref>{{Cite journal|last=Bengio|first=Y.|date=2009-11-15|title=Learning Deep Architectures for AI|url=http://www.nowpublishers.com/article/Details/MAL-006|journal=Foundations and Trends® in Machine Learning|language=English|volume=2|issue=1|pages=1–127|doi=10.1561/2200000006|issn=1935-8237}}</ref>
−
− Each block consists of a simplified [[multi-layer perceptron]] (MLP) with a single hidden layer. The hidden layer '''''h''''' has logistic [[Sigmoid function|sigmoidal]] [[Artificial neuron|units]], and the output layer has linear units. Connections between these layers are represented by weight matrix '''''U;''''' input-to-hidden-layer connections have weight matrix '''''W'''''. Target vectors '''''t''''' form the columns of matrix '''''T''''', and the input data vectors '''''x''''' form the columns of matrix '''''X.''''' The matrix of hidden units is <math>\boldsymbol{H} = \sigma(\boldsymbol{W}^T\boldsymbol{X})</math>. Modules are trained in order, so lower-layer weights '''''W''''' are known at each stage. The function performs the element-wise [[Logistic function|logistic sigmoid]] operation. Each block estimates the same final label class ''y'', and its estimate is concatenated with original input '''''X''''' to form the expanded input for the next block. Thus, the input to the first block contains the original data only, while downstream blocks' input adds the output of preceding blocks. Then learning the upper-layer weight matrix '''''U''''' given other weights in the network can be formulated as a convex optimization problem:
− which has a closed-form solution.+
− +
− Unlike other deep architectures, such as DBNs, the goal is not to discover the transformed [[Feature (machine learning)|feature]] representation. The structure of the hierarchy of this kind of architecture makes parallel learning straightforward, as a batch-mode optimization problem. In purely [[Discriminative model|discriminative tasks]], DSNs perform better than conventional [[Deep belief network|DBN]]<nowiki/>s.<ref name="ref17" />
−
− === Tensor deep stacking networks ===
− This architecture is a DSN extension. It offers two important improvements: it uses higher-order information from [[covariance]] statistics, and it transforms the [[Convex optimization|non-convex problem]] of a lower-layer to a convex sub-problem of an upper-layer.<ref name="ref19">{{cite journal|last2=Deng|first2=Li|last3=Yu|first3=Dong|date=2012|title=Tensor deep stacking networks|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence|volume=1–15|issue=8|pages=1944–1957|doi=10.1109/tpami.2012.268|last1=Hutchinson|first1=Brian}}</ref> TDSNs use covariance statistics in a [[bilinear map]]ping from each of two distinct sets of hidden units in the same layer to predictions, via a third-order [[tensor]].
−
− While parallelization and scalability are not considered seriously in conventional {{H:title|Deep neural networks|DNNs}},<ref name="ref26">{{cite journal|last2=Salakhutdinov|first2=Ruslan|date=2006|title=Reducing the Dimensionality of Data with Neural Networks|journal=Science|volume=313|issue=5786|pages=504–507|doi=10.1126/science.1127647|pmid=16873662|last1=Hinton|first1=Geoffrey|bibcode=2006Sci...313..504H}}</ref><ref name="ref27">{{cite journal|last2=Yu|first2=D.|last3=Deng|first3=L.|last4=Acero|first4=A.|date=2012|title=Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition|journal=IEEE Transactions on Audio, Speech, and Language Processing|volume=20|issue=1|pages=30–42|doi=10.1109/tasl.2011.2134090|last1=Dahl|first1=G.}}</ref><ref name="ref28">{{cite journal|last2=Dahl|first2=George|last3=Hinton|first3=Geoffrey|date=2012|title=Acoustic Modeling Using Deep Belief Networks|journal=IEEE Transactions on Audio, Speech, and Language Processing|volume=20|issue=1|pages=14–22|doi=10.1109/tasl.2011.2109382|last1=Mohamed|first1=Abdel-rahman}}</ref> all learning for {{H:title|Deep stacking network|DSN}}s and {{H:title|Tensor deep stacking network|TDSN}}s is done in batch mode, to allow parallelization.<ref name="ref16" /><ref name="ref17" /> Parallelization allows scaling the design to larger (deeper) architectures and data sets.
−
− The basic architecture is suitable for diverse tasks such as [[Statistical classification|classification]] and [[Regression analysis|regression]].
− +
− The need for deep learning with [[Real number|real-valued]] inputs, as in Gaussian restricted Boltzmann machines, led to the ''spike-and-slab'' [[Restricted Boltzmann machine|RBM]] (''ss''[[Restricted Boltzmann machine|RBM]]), which models continuous-valued inputs with strictly [[Binary variable|binary]] [[latent variable]]s.<ref name="ref30">{{cite journal|last2=Bergstra|first2=James|last3=Bengio|first3=Yoshua|date=2011|title=A Spike and Slab Restricted Boltzmann Machine|url=http://machinelearning.wustl.edu/mlpapers/paper_files/AISTATS2011_CourvilleBB11.pdf|journal=JMLR: Workshop and Conference Proceeding|volume=15|pages=233–241|last1=Courville|first1=Aaron}}</ref> Similar to basic [[Restricted Boltzmann machine|RBMs]] and its variants, a spike-and-slab [[Restricted Boltzmann machine|RBM]] is a [[bipartite graph]], while like [[Restricted Boltzmann machine|GRBMs]], the visible units (input) are real-valued. The difference is in the hidden layer, where each hidden unit has a binary spike variable and a real-valued slab variable. A spike is a discrete [[probability mass]] at zero, while a slab is a [[Probability density|density]] over continuous domain;<ref name="ref32">{{cite conference|last1=Courville|first1=Aaron|last2=Bergstra|first2=James|last3=Bengio|first3=Yoshua|chapter=Unsupervised Models of Images by Spike-and-Slab RBMs|title=Proceedings of the 28th International Conference on Machine Learning|volume=10|pages=1–8|date=2011|url=http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Courville_591.pdf}}</ref> their mixture forms a [[Prior probability|prior]].<ref name="ref31">{{cite journal|last2=Beauchamp|first2=J|date=1988|title=Bayesian Variable Selection in Linear Regression|journal=Journal of the American Statistical Association|volume=83|issue=404|pages=1023–1032|doi=10.1080/01621459.1988.10478694|last1=Mitchell|first1=T}}</ref>+
+
+
− An extension of ss[[Restricted Boltzmann machine|RBM]] called µ-ss[[Restricted Boltzmann machine|RBM]] provides extra modeling capacity using additional terms in the [[energy function]]. One of these terms enables the model to form a [[Conditional probability distribution|conditional distribution]] of the spike variables by [[marginalizing out]] the slab variables given an observation.+
+
+
− === Compound hierarchical-deep models ===+
− Compound hierarchical-deep models compose deep networks with non-parametric [[Bayesian network|Bayesian models]]. [[Feature (machine learning)|Features]] can be learned using deep architectures such as DBNs,<ref name="hinton2006" /> DBMs,<ref name="ref3">{{cite journal|last1=Hinton|first1=Geoffrey|last2=Salakhutdinov|first2=Ruslan|date=2009|title=Efficient Learning of Deep Boltzmann Machines|url=http://machinelearning.wustl.edu/mlpapers/paper_files/AISTATS09_SalakhutdinovH.pdf|volume=3|pages=448–455}}</ref> deep auto encoders,<ref name="ref15">{{cite journal|last2=Bengio|first2=Yoshua|last3=Louradour|first3=Jerdme|last4=Lamblin|first4=Pascal|date=2009|title=Exploring Strategies for Training Deep Neural Networks|url=http://dl.acm.org/citation.cfm?id=1577070|journal=The Journal of Machine Learning Research|volume=10|pages=1–40|last1=Larochelle|first1=Hugo}}</ref> convolutional variants,<ref name="ref39">{{cite journal|last2=Carpenter|first2=Blake|date=2011|title=Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning|url=http://www.iapr-tc11.org/archive/icdar2011/fileup/PDF/4520a440.pdf|journal=|volume=|pages=440–445|via=|last1=Coates|first1=Adam}}</ref><ref name="ref40">{{cite journal|last2=Grosse|first2=Roger|date=2009|title=Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations|url=http://portal.acm.org/citation.cfm?doid=1553374.1553453|journal=Proceedings of the 26th Annual International Conference on Machine Learning|pages=1–8|last1=Lee|first1=Honglak}}</ref> ssRBMs,<ref name="ref32" /> deep coding networks,<ref name="ref41">{{cite journal|last2=Zhang|first2=Tong|date=2010|title=Deep Coding Network|url=http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2010_1077.pdf|journal=Advances in Neural . . .|pages=1–9|last1=Lin|first1=Yuanqing}}</ref> DBNs with sparse feature learning,<ref name="ref42">{{cite journal|last2=Boureau|first2=Y-Lan|date=2007|title=Sparse Feature Learning for Deep Belief Networks|url=http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2007_1118.pdf|journal=Advances in Neural Information Processing Systems|volume=23|pages=1–8|last1=Ranzato|first1=Marc Aurelio}}</ref> RNNs,<ref name="ref43">{{cite journal|last2=Lin|first2=Clif|date=2011|title=Parsing Natural Scenes and Natural Language with Recursive Neural Networks|url=http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Socher_125.pdf|journal=Proceedings of the 26th International Conference on Machine Learning|last1=Socher|first1=Richard}}</ref> conditional DBNs,<ref name="ref44">{{cite journal|last2=Hinton|first2=Geoffrey|date=2006|title=Modeling Human Motion Using Binary Latent Variables|url=http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_693.pdf|journal=Advances in Neural Information Processing Systems|last1=Taylor|first1=Graham}}</ref> de-noising auto encoders.<ref name="ref45">{{cite journal|last2=Larochelle|first2=Hugo|date=2008|title=Extracting and composing robust features with denoising autoencoders|url=http://portal.acm.org/citation.cfm?doid=1390156.1390294|journal=Proceedings of the 25th international conference on Machine learning – ICML '08|pages=1096–1103|last1=Vincent|first1=Pascal}}</ref> This provides a better representation, allowing faster learning and more accurate classification with high-dimensional data. However, these architectures are poor at learning novel classes with few examples, because all network units are involved in representing the input (a '''{{visible anchor|distributed representation}}''') and must be adjusted together (high [[degree of freedom]]). Limiting the degree of freedom reduces the number of parameters to learn, facilitating learning of new classes from few examples. [[Hierarchical Bayesian model|''Hierarchical Bayesian (HB)'' models]] allow learning from few examples, for example<ref name="ref34">{{cite journal|last2=Perfors|first2=Amy|last3=Tenenbaum|first3=Joshua|date=2007|title=Learning overhypotheses with hierarchical Bayesian models|journal=Developmental Science|volume=10|issue=3|pages=307–21|doi=10.1111/j.1467-7687.2007.00585.x|pmid=17444972|last1=Kemp|first1=Charles}}</ref><ref name="ref37">{{cite journal|last2=Tenenbaum|first2=Joshua|date=2007|title=Word learning as Bayesian inference|journal=Psychol. Rev.|volume=114|issue=2|pages=245–72|doi=10.1037/0033-295X.114.2.245|pmid=17500627|last1=Xu|first1=Fei}}</ref><ref name="ref46">{{cite journal|last2=Polatkan|first2=Gungor|date=2011|title=The Hierarchical Beta Process for Convolutional Factor Analysis and Deep Learning|url=http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Chen_251.pdf|journal=Machine Learning . . .|last1=Chen|first1=Bo}}</ref><ref name="ref47">{{cite journal|last2=Fergus|first2=Rob|date=2006|title=One-shot learning of object categories|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence|volume=28|issue=4|pages=594–611|doi=10.1109/TPAMI.2006.79|pmid=16566508|last1=Fei-Fei|first1=Li}}</ref><ref name="ref48">{{cite journal|last2=Dunson|first2=David|date=2008|title=The Nested Dirichlet Process|url=http://amstat.tandfonline.com/doi/full/10.1198/016214508000000553|journal=Journal of the American Statistical Association|volume=103|issue=483|pages=1131–1154|doi=10.1198/016214508000000553|last1=Rodriguez|first1=Abel}}</ref> for computer vision, [[statistics]] and cognitive science.
− Compound HD architectures aim to integrate characteristics of both HB and deep networks. The compound HDP-DBM architecture is a ''[[hierarchical Dirichlet process]] (HDP)'' as a hierarchical model, incorporated with DBM architecture. It is a full [[generative model]], generalized from abstract concepts flowing through the layers of the model, which is able to synthesize new examples in novel classes that look "reasonably" natural. All the levels are learned jointly by maximizing a joint [[Log probability|log-probability]] [[Score (statistics)|score]].<ref name="ref38">{{cite journal|last2=Joshua|first2=Tenenbaum|date=2012|title=Learning with Hierarchical-Deep Models|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence|volume=35|issue=8|pages=1958–71|doi=10.1109/TPAMI.2012.269|pmid=23787346|last1=Ruslan|first1=Salakhutdinov}}</ref>+
+
+
− In a DBM with three hidden layers, the probability of a visible input '''{{mvar|ν}}''' is:+
− where <math>\boldsymbol{h} = \{\boldsymbol{h}^{(1)}, \boldsymbol{h}^{(2)}, \boldsymbol{h}^{(3)} \}</math> is the set of hidden units, and <math>\psi = \{\boldsymbol{W}^{(1)}, \boldsymbol{W}^{(2)}, \boldsymbol{W}^{(3)} \} </math> are the model parameters, representing visible-hidden and hidden-hidden symmetric interaction terms.+
− +
− A learned DBM model is an undirected model that defines the joint distribution <math>P(\nu, h^1, h^2, h^3)</math>. One way to express what has been learned is the [[Discriminative model|conditional model]] <math>P(\nu, h^1, h^2|h^3)</math> and a prior term <math>P(h^3)</math>.
− Here <math>P(\nu, h^1, h^2|h^3)</math> represents a conditional DBM model, which can be viewed as a two-layer DBM but with bias terms given by the states of <math>h^3</math>:+
− +
− A deep predictive coding network (DPCN) is a [[Predictive modelling|predictive]] coding scheme that uses top-down information to empirically adjust the priors needed for a bottom-up [[inference]] procedure by means of a deep, locally connected, [[generative model]]. This works by extracting sparse [[Feature (machine learning)|features]] from time-varying observations using a linear dynamical model. Then, a pooling strategy is used to learn invariant feature representations. These units compose to form a deep architecture and are trained by [[Greedy algorithm|greedy]] layer-wise [[unsupervised learning]]. The layers constitute a kind of [[Markov chain]] such that the states at any layer depend only on the preceding and succeeding layers.+
−
− DPCNs predict the representation of the layer, by using a top-down approach using the information in upper layer and temporal dependencies from previous states.<ref name="ref56">{{cite arXiv|eprint=1301.3541|first2=Jose|last2=Principe|title=Deep Predictive Coding Networks|date=2013|last1=Chalasani|first1=Rakesh|class=cs.LG}}</ref>
−
− DPCNs can be extended to form a [[Convolutional neural network|convolutional network]].<ref name="ref56" />
−
− === Networks with separate memory structures ===
− Integrating external memory with ANNs dates to early research in distributed representations<ref name="Hinton, Geoffrey E 19842">{{Cite web|url=http://repository.cmu.edu/cgi/viewcontent.cgi?article=2841&context=compsci|title=Distributed representations|last=Hinton|first=Geoffrey E.|date=1984|website=|archive-url=|archive-date=|dead-url=|access-date=}}</ref> and [[Teuvo Kohonen|Kohonen]]'s [[self-organizing map]]s. For example, in [[sparse distributed memory]] or [[hierarchical temporal memory]], the patterns encoded by neural networks are used as addresses for [[content-addressable memory]], with "neurons" essentially serving as address [[encoder]]s and [[Binary decoder|decoders]]. However, the early controllers of such memories were not differentiable.
− ==== LSTM-related differentiable memory structures ====+
− Apart from [[long short-term memory]] (LSTM), other approaches also added differentiable memory to recurrent functions. For example:+
− * Differentiable push and pop actions for alternative memory networks called neural stack machines<ref name="S. Das, C.L. Giles p. 79">S. Das, C.L. Giles, G.Z. Sun, "Learning Context Free Grammars: Limitations of a Recurrent Neural Network with an External Stack Memory," Proc. 14th Annual Conf. of the Cog. Sci. Soc., p. 79, 1992.</ref><ref name="Mozer, M. C. 1993 pp. 863-870">{{Cite web|url=https://papers.nips.cc/paper/626-a-connectionist-symbol-manipulator-that-discovers-the-structure-of-context-free-languages|title=A connectionist symbol manipulator that discovers the structure of context-free languages|last=Mozer|first=M. C.|last2=Das|first2=S.|date=1993|website=|publisher=NIPS 5|pages=863–870|archive-url=|archive-date=|dead-url=|access-date=}}</ref>
− * Memory networks where the control network's external differentiable storage is in the fast weights of another network<ref name="ReferenceC">{{cite journal|year=1992|title=Learning to control fast-weight memories: An alternative to recurrent nets|url=|journal=Neural Computation|volume=4|issue=1|pages=131–139|doi=10.1162/neco.1992.4.1.131|last1=Schmidhuber|first1=J.}}</ref>
− * LSTM forget gates<ref name="F. Gers, N. Schraudolph 2002">{{cite journal|last2=Schraudolph|first2=N.|last3=Schmidhuber|first3=J.|date=|year=2002|title=Learning precise timing with LSTM recurrent networks|url=http://jmlr.org/papers/volume3/gers02a/gers02a.pdf|journal=JMLR|volume=3|issue=|pages=115–143|via=|last1=Gers|first1=F.}}</ref>
− * Self-referential RNNs with special output units for addressing and rapidly manipulating the RNN's own weights in differentiable fashion (internal storage)<ref name="J. Schmidhuber pages 191-195">{{Cite conference|author=[[Jürgen Schmidhuber]]|title=An introspective network that can learn to run its own weight change algorithm|booktitle=In Proc. of the Intl. Conf. on Artificial Neural Networks, Brighton|pages=191–195|publisher=IEE|year=1993|url=ftp://ftp.idsia.ch/pub/juergen/iee93self.ps.gz}}</ref><ref name="Hochreiter, Sepp 2001">{{cite journal|last2=Younger|first2=A. Steven|last3=Conwell|first3=Peter R.|date=|year=2001|title=Learning to Learn Using Gradient Descent|url=http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.5.323|journal=ICANN|volume=2130|issue=|pages=87–94|doi=|via=|last1=Hochreiter|first1=Sepp}}</ref>
− * Learning to transduce with unbounded memory<ref name="Grefenstette, Edward 1506">Grefenstette, Edward, et al. [https://arxiv.org/pdf/1506.02516.pdf "Learning to Transduce with Unbounded Memory."]{{arxiv|1506.02516}} (2015).</ref>
− +
− {{Main|Neural Turing machine}}Neural Turing machines<ref name="Graves, Alex 14102">Graves, Alex, Greg Wayne, and Ivo Danihelka. "Neural Turing Machines." {{arxiv|1410.5401}} (2014).</ref> couple LSTM networks to external memory resources, with which they can interact by attentional processes. The combined system is analogous to a [[Turing machine]] but is differentiable end-to-end, allowing it to be efficiently trained by [[gradient descent]]. Preliminary results demonstrate that neural Turing machines can infer simple algorithms such as copying, sorting and associative recall from input and output examples.+
+
− [[Differentiable neural computer]]s (DNC) are an NTM extension. They out-performed Neural turing machines, [[long short-term memory]] systems and memory networks on sequence-processing tasks.<ref name=":02">{{Cite news|url=https://www.wired.co.uk/article/deepmind-ai-tube-london-underground|title=DeepMind's AI learned to ride the London Underground using human-like reason and memory|last=Burgess|first=Matt|newspaper=WIRED UK|language=en-GB|access-date=2016-10-19}}</ref><ref>{{Cite news|url=https://www.pcmag.com/news/348701/deepmind-ai-learns-to-navigate-london-tube|title=DeepMind AI 'Learns' to Navigate London Tube|newspaper=PCMAG|access-date=2016-10-19}}</ref><ref>{{Cite web|url=https://techcrunch.com/2016/10/13/__trashed-2/|title=DeepMind's differentiable neural computer helps you navigate the subway with its memory|last=Mannes|first=John|website=TechCrunch|access-date=2016-10-19}}</ref><ref>{{Cite journal|last=Graves|first=Alex|last2=Wayne|first2=Greg|last3=Reynolds|first3=Malcolm|last4=Harley|first4=Tim|last5=Danihelka|first5=Ivo|last6=Grabska-Barwińska|first6=Agnieszka|last7=Colmenarejo|first7=Sergio Gómez|last8=Grefenstette|first8=Edward|last9=Ramalho|first9=Tiago|date=2016-10-12|title=Hybrid computing using a neural network with dynamic external memory|url=http://www.nature.com/nature/journal/vaop/ncurrent/full/nature20101.html|journal=Nature|language=en|volume=538|issue=7626|doi=10.1038/nature20101|issn=1476-4687|pages=471–476|pmid=27732574|bibcode=2016Natur.538..471G}}</ref><ref>{{Cite web|url=https://deepmind.com/blog/differentiable-neural-computers/|title=Differentiable neural computers {{!}} DeepMind|website=DeepMind|access-date=2016-10-19}}</ref>+
+
+
+
+
+
+
− +
− Approaches that represent previous experiences directly and [[Instance-based learning|use a similar experience to form a local model]] are often called [[K-nearest neighbor algorithm|nearest neighbour]] or [[K-nearest neighbors algorithm|k-nearest neighbors]] methods.<ref>{{cite journal|last2=Schaal|first2=Stefan|year=1995|title=Memory-based neural networks for robot learning|url=|journal=Neurocomputing|volume=9|issue=3|pages=243–269|doi=10.1016/0925-2312(95)00033-6|last1=Atkeson|first1=Christopher G.}}</ref> Deep learning is useful in semantic hashing<ref>Salakhutdinov, Ruslan, and Geoffrey Hinton. [http://www.utstat.toronto.edu/~rsalakhu/papers/sdarticle.pdf "Semantic hashing."] International Journal of Approximate Reasoning 50.7 (2009): 969–978.</ref> where a deep [[graphical model]] the word-count vectors<ref name="Le 2014">{{Cite arXiv|eprint=1405.4053|first=Quoc V.|last=Le|first2=Tomas|last2=Mikolov|title=Distributed representations of sentences and documents|year=2014|class=cs.CL}}</ref> obtained from a large set of documents.{{Clarify|reason=verb missing|date=June 2017}} Documents are mapped to memory addresses in such a way that semantically similar documents are located at nearby addresses. Documents similar to a query document can then be found by accessing all the addresses that differ by only a few bits from the address of the query document. Unlike [[sparse distributed memory]] that operates on 1000-bit addresses, semantic hashing works on 32 or 64-bit addresses found in a conventional computer architecture.+
+
− +
− Memory networks<ref name="Weston, Jason 14102">Weston, Jason, Sumit Chopra, and Antoine Bordes. "Memory networks." {{arxiv|1410.3916}} (2014).</ref><ref>Sukhbaatar, Sainbayar, et al. "End-To-End Memory Networks." {{arxiv|1503.08895}} (2015).</ref> are another extension to neural networks incorporating [[long-term memory]]. The long-term memory can be read and written to, with the goal of using it for prediction. These models have been applied in the context of [[question answering]] (QA) where the long-term memory effectively acts as a (dynamic) knowledge base and the output is a textual response.<ref>Bordes, Antoine, et al. "Large-scale Simple Question Answering with Memory Networks." {{arxiv|1506.02075}} (2015).</ref> A team of electrical and computer engineers from UCLA Samueli School of Engineering has created a physical artificial neural network. That can analyze large volumes of data and identify objects at the actual speed of light.<ref>{{Cite news|url=https://www.sciencedaily.com/releases/2018/08/180802130750.htm|title=AI device identifies objects at the speed of light: The 3D-printed artificial neural network can be used in medicine, robotics and security|work=ScienceDaily|access-date=2018-08-08|language=en}}</ref>+
− +
− Deep neural networks can be potentially improved by deepening and parameter reduction, while maintaining trainability. While training extremely deep (e.g., 1 million layers) neural networks might not be practical, [[CPU]]-like architectures such as pointer networks<ref>Vinyals, Oriol, Meire Fortunato, and Navdeep Jaitly. "Pointer networks." {{arxiv|1506.03134}} (2015).</ref> and neural random-access machines<ref>Kurach, Karol, Andrychowicz, Marcin and Sutskever, Ilya. "Neural Random-Access Machines." {{arxiv|1511.06392}} (2015).</ref> overcome this limitation by using external [[random-access memory]] and other components that typically belong to a [[computer architecture]] such as [[Processor register|registers]], [[Arithmetic logic unit|ALU]] and [[Pointer (computer programming)|pointers]]. Such systems operate on [[probability distribution]] vectors stored in memory cells and registers. Thus, the model is fully differentiable and trains end-to-end. The key characteristic of these models is that their depth, the size of their short-term memory, and the number of parameters can be altered independently – unlike models like LSTM, whose number of parameters grows quadratically with memory size.+
− +
− Encoder–decoder frameworks are based on neural networks that map highly [[Structured prediction|structured]] input to highly structured output. The approach arose in the context of [[machine translation]],<ref>{{Cite web|url=http://www.aclweb.org/anthology/D13-1176|title=Recurrent continuous translation models|last=Kalchbrenner|first=N.|last2=Blunsom|first2=P.|date=2013|website=|publisher=EMNLP'2013|archive-url=|archive-date=|dead-url=|access-date=}}</ref><ref>{{Cite web|url=https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf|title=Sequence to sequence learning with neural networks|last=Sutskever|first=I.|last2=Vinyals|first2=O.|date=2014|website=|publisher=NIPS'2014|archive-url=|archive-date=|dead-url=|access-date=|last3=Le|first3=Q. V.}}</ref><ref>{{Cite journal|last=Cho|first=K.|last2=van Merrienboer|first2=B.|last3=Gulcehre|first3=C.|last4=Bougares|first4=F.|last5=Schwenk|first5=H.|last6=Bengio|first6=Y.|date=October 2014|title=Learning phrase representations using RNN encoder-decoder for statistical machine translation|journal=Proceedings of the Empiricial Methods in Natural Language Processing|volume=1406|pages=arXiv:1406.1078|via=|arxiv=1406.1078|bibcode=2014arXiv1406.1078C}}</ref> where the input and output are written sentences in two natural languages. In that work, an LSTM RNN or CNN was used as an encoder to summarize a source sentence, and the summary was decoded using a conditional RNN [[language model]] to produce the translation.<ref>Cho, Kyunghyun, Aaron Courville, and Yoshua Bengio. "Describing Multimedia Content using Attention-based Encoder–Decoder Networks." {{arxiv|1507.01053}} (2015).</ref> These systems share building blocks: gated RNNs and CNNs and trained attention mechanisms.+
− +
− Multilayer kernel machines (MKM) are a way of learning highly nonlinear functions by iterative application of weakly nonlinear kernels. They use the [[kernel principal component analysis]] (KPCA),<ref name="ref60">{{cite journal|last2=Smola|first2=Alexander|date=1998|title=Nonlinear component analysis as a kernel eigenvalue problem|journal=Neural computation|volume=(44)|issue=5|pages=1299–1319|doi=10.1162/089976698300017467|last1=Scholkopf|first1=B|citeseerx=10.1.1.53.8911}}</ref> as a method for the [[Unsupervised learning|unsupervised]] greedy layer-wise pre-training step of deep learning.<ref name="ref59">{{cite journal|date=2012|title=Kernel Methods for Deep Learning|url=http://cseweb.ucsd.edu/~yoc002/paper/thesis_youngmincho.pdf|pages=1–9|last1=Cho|first1=Youngmin}}</ref>+
− Layer <math>l+1</math> learns the representation of the previous layer <math>l</math>, extracting the <math>n_l</math> [[Principal component analysis|principal component]] (PC) of the projection layer <math>l</math> output in the feature domain induced by the kernel. For the sake of [[dimensionality reduction]] of the updated representation in each layer, a [[Supervised learning|supervised strategy]] selects the best informative features among features extracted by KPCA. The process is:+
− +
− +
− +
− Some drawbacks accompany the KPCA method as the building cells of an MKM.+
+
− A more straightforward way to use kernel machines for deep learning was developed for spoken language understanding.<ref>{{Cite journal|last=Deng|first=Li|last2=Tur|first2=Gokhan|last3=He|first3=Xiaodong|last4=Hakkani-Tür|first4=Dilek|date=2012-12-01|title=Use of Kernel Deep Convex Networks and End-To-End Learning for Spoken Language Understanding|url=https://www.microsoft.com/en-us/research/publication/use-of-kernel-deep-convex-networks-and-end-to-end-learning-for-spoken-language-understanding/|journal=Microsoft Research|language=en-US}}</ref> The main idea is to use a kernel machine to approximate a shallow neural net with an infinite number of hidden units, then use [[Deep learning#Deep stacking networks|stacking]] to splice the output of the kernel machine and the raw input in building the next, higher level of the kernel machine. The number of levels in the deep convex network is a hyper-parameter of the overall system, to be determined by cross validation.+
+
− +
− {{Main|Neural architecture search}}+
− Neural architecture search (NAS) uses machine learning to automate the design of ANNs. Various approaches to NAS have designed networks that compare well with hand-designed systems. The basic search algorithm is to propose a candidate model, evaluate it against a dataset and use the results as feedback to teach the NAS network.<ref>{{cite arxiv|last=Zoph|first=Barret|last2=Le|first2=Quoc V.|date=2016-11-04|title=Neural Architecture Search with Reinforcement Learning|eprint=1611.01578|class=cs.LG}}</ref>
− +
+
+
+
+
+
− Using ANNs requires an understanding of their characteristics.+
− +
− +
− +
− ANN capabilities fall within the following broad categories:{{Citation needed|date=June 2017}}+
− * [[Function approximation]], or [[regression analysis]], including [[Time series#Prediction and forecasting|time series prediction]], [[fitness approximation]] and modeling.+
− * [[Statistical classification|Classification]], including [[Pattern recognition|pattern]] and sequence recognition, [[novelty detection]] and sequential decision making.+
− * [[Data processing]], including filtering, clustering, [[blind source separation]] and compression.
− * [[Robotics]], including directing manipulators and [[prosthesis|prostheses]].
− * [[Control engineering|Control]], including [[computer numerical control]].
− ==Applications==+
− Because of their ability to reproduce and model nonlinear processes, ANNs have found many applications in a wide range of disciplines.
− Application areas include [[system identification]] and control (vehicle control, trajectory prediction,<ref>{{cite journal|last1=Zissis|first1=Dimitrios|title=A cloud based architecture capable of perceiving and predicting multiple vessel behaviour|journal=Applied Soft Computing|date=October 2015|volume=35|url=http://www.sciencedirect.com/science/article/pii/S1568494615004329|doi=10.1016/j.asoc.2015.07.002|pages=652–661}}</ref> [[process control]], [[natural resource management]]), [[quantum chemistry]],<ref name="Balabin_2009">{{Cite journal|journal=[[J. Chem. Phys.]] |volume = 131 |issue = 7 |page = 074104 |doi=10.1063/1.3206326 |title=Neural network approach to quantum-chemistry data: Accurate prediction of density functional theory energies |year=2009 |author1=Roman M. Balabin |author2=Ekaterina I. Lomakina |pmid=19708729|bibcode = 2009JChPh.131g4104B }}</ref> game-playing and [[decision making]] ([[backgammon]], [[chess]], [[poker]]), [[pattern recognition]] (radar systems, [[Facial recognition system|face identification]], signal classification,<ref>{{cite journal|last=Sengupta|first=Nandini|author2=Sahidullah, Md|author3=Saha, Goutam|title=Lung sound classification using cepstral-based statistical features|journal=Computers in Biology and Medicine|date=August 2016|volume=75|issue=1|pages=118–129|doi=10.1016/j.compbiomed.2016.05.013|url=http://www.sciencedirect.com/science/article/pii/S0010482516301263}}</ref> object recognition and more), sequence recognition (gesture, speech, handwritten and printed text recognition), [[medical diagnosis]], finance<ref>{{cite journal|last1=French|first1=Jordan|title=The time traveller's CAPM|journal=Investment Analysts Journal|volume=46|issue=2|pages=81–96|doi=10.1080/10293523.2016.1255469|url=http://www.tandfonline.com/doi/abs/10.1080/10293523.2016.1255469|year=2016}}</ref> (e.g. [[algorithmic trading|automated trading systems]]), [[data mining]], visualization, [[machine translation]], social network filtering<ref>{{Cite news|url=https://www.wsj.com/articles/facebook-boosts-a-i-to-block-terrorist-propaganda-1497546000|title=Facebook Boosts A.I. to Block Terrorist Propaganda|last=Schechner|first=Sam|date=2017-06-15|work=Wall Street Journal|access-date=2017-06-16|language=en-US|issn=0099-9660}}</ref> and [[e-mail spam]] filtering.+
+
+
− ANNs have been used to diagnose cancers, including [[lung cancer]],<ref>{{cite web|last=Ganesan|first=N|title=Application of Neural Networks in Diagnosing Cancer Disease Using Demographic Data|url=http://www.ijcaonline.org/journal/number26/pxc387783.pdf|publisher=International Journal of Computer Applications}}</ref> [[prostate cancer]], [[colorectal cancer]]<ref>{{cite web|url=http://www.lcc.uma.es/~jja/recidiva/042.pdf|title=Artificial Neural Networks Applied to Outcome Prediction for Colorectal Cancer Patients in Separate Institutions|last=Bottaci|first=Leonardo|publisher=The Lancet}}</ref> and to distinguish highly invasive cancer cell lines from less invasive lines using only cell shape information.<ref>{{cite journal|last2=Lyons|first2=Samanthe M|last3=Castle|first3=Jordan M|last4=Prasad|first4=Ashok|date=2016|title=Measuring systematic changes in invasive cancer cell shape using Zernike moments|url=http://pubs.rsc.org/en/Content/ArticleLanding/2016/IB/C6IB00100A#!divAbstract|journal=Integrative Biology|volume=8|issue=11|pages=1183–1193|doi=10.1039/C6IB00100A|pmid=27735002|last1=Alizadeh|first1=Elaheh}}</ref><ref>{{cite journal|date=2016|title=Changes in cell shape are correlated with metastatic potential in murine|url=http://bio.biologists.org/content/5/3/289|journal=Biology Open|volume=5|issue=3|pages=289–299|doi=10.1242/bio.013409|last1=Lyons|first1=Samanthe}}</ref>+
+
− ANNs have been used to accelerate reliability analysis of infrastructures subject to natural disasters.<ref>{{cite arxiv|last=Nabian|first=Mohammad Amin|last2=Meidani|first2=Hadi|date=2017-08-28|title=Deep Learning for Accelerated Reliability Analysis of Infrastructure Networks|eprint=1708.08551|class=cs.CE}}</ref><ref>{{Cite journal|last=Nabian|first=Mohammad Amin|last2=Meidani|first2=Hadi|date=2018|title=Accelerating Stochastic Assessment of Post-Earthquake Transportation Network Connectivity via Machine-Learning-Based Surrogates|url=https://trid.trb.org/view/1496617|journal=Transportation Research Board 97th Annual Meeting|volume=|pages=|via=}}</ref>
− ANNs have also been used for building black-box models in [[geoscience]]: [[hydrology]],<ref>{{Cite journal|last=null null|date=2000-04-01|title=Artificial Neural Networks in Hydrology. I: Preliminary Concepts|url=http://ascelibrary.org/doi/abs/10.1061/(ASCE)1084-0699(2000)5:2(115)|journal=Journal of Hydrologic Engineering|volume=5|issue=2|pages=115–123|doi=10.1061/(ASCE)1084-0699(2000)5:2(115)}}</ref><ref>{{Cite journal|last=null null|date=2000-04-01|title=Artificial Neural Networks in Hydrology. II: Hydrologic Applications|url=http://ascelibrary.org/doi/abs/10.1061/(ASCE)1084-0699(2000)5:2(124)|journal=Journal of Hydrologic Engineering|volume=5|issue=2|pages=124–137|doi=10.1061/(ASCE)1084-0699(2000)5:2(124)}}</ref> ocean modelling and [[coastal engineering]],<ref>{{Cite journal|last=Peres|first=D. J.|last2=Iuppa|first2=C.|last3=Cavallaro|first3=L.|last4=Cancelliere|first4=A.|last5=Foti|first5=E.|date=2015-10-01|title=Significant wave height record extension by neural networks and reanalysis wind data|url=http://www.sciencedirect.com/science/article/pii/S1463500315001432|journal=Ocean Modelling|volume=94|pages=128–140|doi=10.1016/j.ocemod.2015.08.002|bibcode=2015OcMod..94..128P}}</ref><ref>{{Cite journal|last=Dwarakish|first=G. S.|last2=Rakshith|first2=Shetty|last3=Natesan|first3=Usha|date=2013|title=Review on Applications of Neural Network in Coastal Engineering|url=http://www.ciitresearch.org/dl/index.php/aiml/article/view/AIML072013007|journal=Artificial Intelligent Systems and Machine Learning|language=English|volume=5|issue=7|pages=324–331}}</ref> and [[geomorphology]],<ref>{{Cite journal|last=Ermini|first=Leonardo|last2=Catani|first2=Filippo|last3=Casagli|first3=Nicola|date=2005-03-01|title=Artificial Neural Networks applied to landslide susceptibility assessment|url=http://www.sciencedirect.com/science/article/pii/S0169555X04002272|journal=Geomorphology|series=Geomorphological hazard and human impact in mountain environments|volume=66|issue=1|pages=327–343|doi=10.1016/j.geomorph.2004.09.025|bibcode=2005Geomo..66..327E}}</ref> are just few examples of this kind.+
+
− ===Types of models===+
− Many types of models are used, defined at different levels of abstraction and modeling different aspects of neural systems. They range from models of the short-term behavior of [[biological neuron models|individual neurons]],<ref>{{cite journal | author=Forrest MD |title=Simulation of alcohol action upon a detailed Purkinje neuron model and a simpler surrogate model that runs >400 times faster |journal= BMC Neuroscience | volume=16 |issue=27 | date=April 2015 |doi=10.1186/s12868-015-0162-6 |url=http://www.biomedcentral.com/1471-2202/16/27 }}</ref> models of how the dynamics of neural circuitry arise from interactions between individual neurons and finally to models of how behavior can arise from abstract neural modules that represent complete subsystems. These include models of the long-term, and short-term plasticity, of neural systems and their relations to learning and memory from the individual neuron to the system level.+
− +
− ===Computational power===+
− The [[multilayer perceptron]] is a universal function approximator, as proven by the [[universal approximation theorem]]. However, the proof is not constructive regarding the number of neurons required, the network topology, the weights and the learning parameters.+
+
− A specific recurrent architecture with rational valued weights (as opposed to full precision [[real number]]-valued weights) has the full power of a [[Universal Turing Machine|universal Turing machine]],<ref>{{Cite journal| title = Turing computability with neural nets | url = http://www.math.rutgers.edu/~sontag/FTPDIR/aml-turing.pdf | year = 1991 | journal = Appl. Math. Lett. | pages = 77–80 | volume = 4 | issue = 6 | last1 = Siegelmann | first1 = H.T. | last2 = Sontag | first2 = E.D. | doi = 10.1016/0893-9659(91)90080-F }}</ref> using a finite number of neurons and standard linear connections. Further, the use of irrational values for weights results in a machine with [[Hypercomputation|super-Turing]] power.<ref>{{cite journal |last1=Balcázar |first1=José |title=Computational Power of Neural Networks: A Kolmogorov Complexity Characterization |journal=Information Theory, IEEE Transactions on |date=Jul 1997 |volume=43 |issue=4 |pages=1175–1183 |doi=10.1109/18.605580 |url=http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=605580&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D605580 |accessdate=3 November 2014|citeseerx=10.1.1.411.7782 }}</ref>+
+
− ===Capacity===+
− Models' "capacity" property roughly corresponds to their ability to model any given function. It is related to the amount of information that can be stored in the network and to the notion of complexity.{{citation needed|date=February 2017}}+
− ===Convergence===+
− Models may not consistently converge on a single solution, firstly because many local minima may exist, depending on the cost function and the model. Secondly, the optimization method used might not guarantee to converge when it begins far from any local minimum. Thirdly, for sufficiently large data or parameters, some methods become impractical. However, for [[cerebellar model articulation controller|CMAC]] neural network, a recursive least squares algorithm was introduced to train it, and this algorithm can be guaranteed to converge in one step.<ref name="Qin1"/>+
−
− ===Generalization and statistics===
− Applications whose goal is to create a system that generalizes well to unseen examples, face the possibility of over-training. This arises in convoluted or over-specified systems when the capacity of the network significantly exceeds the needed free parameters. Two approaches address over-training. The first is to use [[cross-validation (statistics)|cross-validation]] and similar techniques to check for the presence of over-training and optimally select hyperparameters to minimize the generalization error. The second is to use some form of ''[[regularization (mathematics)|regularization]]''. This concept emerges in a probabilistic (Bayesian) framework, where regularization can be performed by selecting a larger prior probability over simpler models; but also in statistical learning theory, where the goal is to minimize over two quantities: the 'empirical risk' and the 'structural risk', which roughly corresponds to the error over the training set and the predicted error in unseen data due to overfitting.
−
− [[File:Synapse deployment.jpg|thumb|right|Confidence analysis of a neural network]]
− Supervised neural networks that use a [[mean squared error]] (MSE) cost function can use formal statistical methods to determine the confidence of the trained model. The MSE on a validation set can be used as an estimate for variance. This value can then be used to calculate the [[confidence interval]] of the output of the network, assuming a [[normal distribution]]. A confidence analysis made this way is statistically valid as long as the output [[probability distribution]] stays the same and the network is not modified.
−
− By assigning a [[softmax activation function]], a generalization of the [[logistic function]], on the output layer of the neural network (or a softmax component in a component-based neural network) for categorical target variables, the outputs can be interpreted as posterior probabilities. This is very useful in classification as it gives a certainty measure on classifications.
−
− The softmax activation function is:
− <section end="theory" />
− +
− +
− A common criticism of neural networks, particularly in robotics, is that they require too much training for real-world operation.{{Citation needed|date=November 2014}} Potential solutions include randomly shuffling training examples, by using a numerical optimization algorithm that does not take too large steps when changing the network connections following an example and by grouping examples in so-called mini-batches. Improving the training efficiency and convergence capability has always been an ongoing research area for neural network. For example, by introducing a recursive least squares algorithm for [[cerebellar model articulation controller|CMAC]] neural network, the training process only takes one step to converge.<ref name="Qin1"/>+
+
+
− A fundamental objection is that they do not reflect how real neurons function. Back propagation is a critical part of most artificial neural networks, although no such mechanism exists in biological neural networks.<ref>{{cite journal | last1 = Crick | first1 = Francis | year = 1989 | title = The recent excitement about neural networks | journal = Nature | volume = 337 | issue = 6203 | pages = 129–132 | doi = 10.1038/337129a0 | url = http://europepmc.org/abstract/med/2911347 | pmid=2911347| bibcode = 1989Natur.337..129C }}</ref> How information is coded by real neurons is not known. [[Sensory neuron|Sensor neurons]] fire [[action potential]]s more frequently with sensor activation and [[muscle cell]]s pull more strongly when their associated [[motor neuron]]s receive action potentials more frequently.<ref>{{cite journal | last1 = Adrian | first1 = Edward D. | year = 1926 | title = The impulses produced by sensory nerve endings | journal = The Journal of Physiology | volume = 61 | issue = 1 | pages = 49–72 | doi = 10.1113/jphysiol.1926.sp002273 | pmid = 16993776 | pmc = 1514809 | url = http://onlinelibrary.wiley.com/doi/10.1113/jphysiol.1926.sp002273/full }}</ref> Other than the case of relaying information from a sensor neuron to a motor neuron, almost nothing of the principles of how information is handled by biological neural networks is known.+
− The motivation behind ANNs is not necessarily to strictly replicate neural function, but to use biological neural networks as an inspiration. A central claim of ANNs is therefore that it embodies some new and powerful general principle for processing information. Unfortunately, these general principles are ill-defined. It is often claimed that they are [[Emergent properties|emergent]] from the network itself. This allows simple statistical association (the basic function of artificial neural networks) to be described as learning or recognition. [[Alexander Dewdney]] commented that, as a result, artificial neural networks have a "something-for-nothing quality, one that imparts a peculiar aura of laziness and a distinct lack of curiosity about just how good these computing systems are. No human hand (or mind) intervenes; solutions are found as if by magic; and no one, it seems, has learned anything".<ref>{{cite book|url={{google books |plainurl=y |id=KcHaAAAAMAAJ|page=82}}|title=Yes, we have no neutrons: an eye-opening tour through the twists and turns of bad science|last=Dewdney|first=A. K.|date=1 April 1997|publisher=Wiley|year=|isbn=978-0-471-10806-1|location=|pages=82}}</ref>+
+
− Biological brains use both shallow and deep circuits as reported by brain anatomy,<ref name="VanEssen1991">D. J. Felleman and D. C. Van Essen, "[http://cercor.oxfordjournals.org/content/1/1/1.1.full.pdf+html Distributed hierarchical processing in the primate cerebral cortex]," ''Cerebral Cortex'', 1, pp. 1–47, 1991.</ref> displaying a wide variety of invariance. Weng<ref name="Weng2012">J. Weng, "[https://www.amazon.com/Natural-Artificial-Intelligence-Introduction-Computational/dp/0985875720 Natural and Artificial Intelligence: Introduction to Computational Brain-Mind]," BMI Press, {{ISBN|978-0985875725}}, 2012.</ref> argued that the brain self-wires largely according to signal statistics and therefore, a serial cascade cannot catch all major statistical dependencies.+
− ===Hardware issues===+
− Large and effective neural networks require considerable computing resources.<ref name=":0">{{cite journal|last1=Edwards|first1=Chris|title=Growing pains for deep learning|journal=Communications of the ACM|date=25 June 2015|volume=58|issue=7|pages=14–16|doi=10.1145/2771283}}</ref> While the brain has hardware tailored to the task of processing signals through a [[Graph (discrete mathematics)|graph]] of neurons, simulating even a simplified neuron on [[von Neumann architecture]] may compel a neural network designer to fill many millions of [[database]] rows for its connections{{snd}} which can consume vast amounts of [[Random-access memory|memory]] and storage. Furthermore, the designer often needs to transmit signals through many of these connections and their associated neurons{{snd}} which must often be matched with enormous [[Central processing unit|CPU]] processing power and time.
− [[Jürgen Schmidhuber|Schmidhuber]] notes that the resurgence of neural networks in the twenty-first century is largely attributable to advances in hardware: from 1991 to 2015, computing power, especially as delivered by [[General-purpose computing on graphics processing units|GPGPUs]] (on [[Graphics processing unit|GPUs]]), has increased around a million-fold, making the standard backpropagation algorithm feasible for training networks that are several layers deeper than before.<ref>{{cite journal |last=Schmidhuber |first=Jürgen |title=Deep learning in neural networks: An overview |journal=Neural Networks |volume=61 |year=2015 |pages=85–117 |arxiv=1404.7828 |doi=10.1016/j.neunet.2014.09.003|pmid=25462637 }}</ref> The use of parallel GPUs can reduce training times from months to days.{{r|:0}}+
+
+
− [[Neuromorphic engineering]] addresses the hardware difficulty directly, by constructing non-von-Neumann chips to directly implement neural networks in circuitry. Another chip optimized for neural network processing is called a [[Tensor Processing Unit]], or TPU.<ref>{{cite news |url=https://www.wired.com/2016/05/google-tpu-custom-chips/ |author=Cade Metz |newspaper=Wired |date=May 18, 2016 |title=Google Built Its Very Own Chips to Power Its AI Bots}}</ref>+
− +
− ===Practical counterexamples to criticisms===
− Arguments against Dewdney's position are that neural networks have been successfully used to solve many complex and diverse tasks, ranging from autonomously flying aircraft<ref>[http://www.nasa.gov/centers/dryden/news/NewsReleases/2003/03-49.html NASA – Dryden Flight Research Center – News Room: News Releases: NASA NEURAL NETWORK PROJECT PASSES MILESTONE]. Nasa.gov. Retrieved on 2013-11-20.</ref> to detecting credit card fraud to mastering the game of [[Go (game)|Go]].
−
− Technology writer Roger Bridgman commented:
−
−
− In spite of his emphatic declaration that science is not technology, Dewdney seems here to pillory neural nets as bad science when most of those devising them are just trying to be good engineers. An unreadable table that a useful machine could read would still be well worth having.<ref>[http://members.fortunecity.com/templarseries/popper.html Roger Bridgman's defence of neural networks]</ref>
− +
− Although it is true that analyzing what has been learned by an artificial neural network is difficult, it is much easier to do so than to analyze what has been learned by a biological neural network. Furthermore, researchers involved in exploring learning algorithms for neural networks are gradually uncovering general principles that allow a learning machine to be successful. For example, local vs non-local learning and shallow vs deep architecture.<ref>{{cite web|url=http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/4|title=Scaling Learning Algorithms towards {AI} – LISA – Publications – Aigaion 2.0|publisher=}}</ref>
− +
+
+
+
− 人工神经网络有很多类型。最简单的静态类型有一个或多个静态部分,包括一些单元,一些层,单元权重和[https://en.wikipedia.org/wiki/Topology 拓扑学]。动态类型允许这些中的一个或多个在学习过程中变化。后者更复杂,但是可以缩短学习时长并且产生更好的结果。一些类型允许/需要被操作“监督”,而另一些操作独立。一些类型的操作完全在硬件层面,而其他的完全在软件而且在通用计算机上运行。
无编辑摘要
强化学习范式中的任务是控制问题,【游戏】和其他序列决策任务。
强化学习范式中的任务是控制问题,【游戏】和其他序列决策任务。
==== Convergent recursive learning algorithm ====
==== 收敛递归学习算法(Convergent recursive learning algorithm) ====
这是一种特别为【小脑模型关节控制器】(CMAC)神经网络设计的学习方法。2014,一种递推最小二乘法被引入在线训练【CMAC】神经网络。这个算法可以一步收敛,然后根据任何新输入的数据在一步内更新所有权重。最初,这个算法有''O''(''N''<sup>3</sup>)的【计算复杂度】。基于【QR分解】,这种递推学习算法被简化为''O''(''N'').
===Learning algorithms===
===学习算法===
训练一个神经网络模型本质上意味着从一组允许的模型(或者在一个决定在一组允许的模型上分布的贝叶斯框架)中选择一个最小化损失函数的模型。可以使用多种算法训练神经网络模型;它们中的大多数可以被看成【优化】理论和【统计性估计】的直接应用。
大多采用【梯度下降】的某种形式,使用反向传播计算实际梯度。这通过简单的对网络参数取损失函数梯度然后向【梯度相关】方向改变这些参数完成。
反向传播训练算法有这三类:
* 【最速下降】(带参数学习速率和【动量】,【弹性反向传播】;
* 拟牛顿 (Broyden-Fletcher-Goldfarb-Shanno),【单步割线】;
* 【Levenberg-Marquardt】和【共轭梯度】(Fletcher-Reeves 更新, Polak-Ribiére 更新, Powell-Beale 重启,标度共轭梯度)。
【进化法】,【基因表达式编程】,【模拟退火】,【期望最大化】,【非参数方法】和【粒子群算法】是训练神经网络的其他方法。
== 变体 ==
=== 数据处理的群方法(Group method of data handling) ===
数据处理的群方法(GMDH) 突出了全自动结构和参数化模型优化。结点激活函数是允许加法和乘法操作的【Kolmogorov】-Gabor多项式。它使用八层的深度前馈多层感知机,是一个逐层增长的【监督学习】网络,其中每层使用【回归分析】训练。使用验证集检测无用的项,通过【正则化】消除。结果网络的尺寸和深度取决于任务。
=== 卷积神经网络(Convolutional neural networks) ===
卷积神经网络 (CNN) 是一类深度前馈网络,由一或多层【卷积】层和位于其上的全连接层(与典型ANN中的匹配)组成。它使用相等权重和池化层。特别地,最大池化通常通过Fukushima的卷积结构组织。这种结构允许CNN利用输入数据的2D结构
CNN适合处理视觉和其他二维数据,它们在图像和语音应用中展示出了优秀的结果。它们可以被标准反向传播训练。CNN比其他普通的深度前馈神经网络更容易训练且有更少的需要估计的参数。计算机视觉中应用的例子包括【DeepDream】和【机器人导航】。
===长短期记忆( Long short-term memory) ===
长短期记忆 (LSTM) 网络是避免了【梯度消失问题】。LSTM通常被称为遗忘门的循环门扩展。 LSTM网络避免了反向传播误差的消失或爆炸。误差可以通过在空间展开的LSTM中的无限制的虚层反向回流 。也就是说,LSTM可以学习“非常深的学习”任务,这些任务需要记住上千甚至上百万离散时间步前的事件。问题特殊的LSTM形态的拓扑结构可以成为进化的LSTM,能处理长延迟和混合高低频成分的信号。
大量LSTM RNN使用联结主义时间分类(CTC)训练,给定相应输入序列,可以找到一个最大化训练集中标记序列概率的RNN权重矩阵。CTC达到了校准和识别。
2003,LSTM开始在传统语音识别器中具有竞争力。2007,与CTC的结合在语音数据上达到了第一个良好的结果。2009,一个CTC训练的LSTM成为第一个赢得模式识别比赛的RNN,当它赢得了几个连笔【手写识别】比赛。2014,【百度】使用CTC训练的RNN打破了Switchboard Hub5'00语音识别在基准测试数据集上的表现,而没有使用传统语音处理方法。LSTM也提高了大量词汇语音识别,文本到语音合成,对谷歌安卓和真实图片的传声头像。2015,谷歌的语音识别通过CTC训练的LSTM提高了49%的性能。
LSTM在【自然语言处理】中变得受欢迎。不像之前基于【隐式马尔科夫模型】和相似概念的模型,LSTM可以学习识别【上下文有关语言】。LSTM提高了机器翻译,【语言建模】和多语言语言处理。与CNN结合的LSTM提高了自动图像字幕标记。
=== Convolutional neural networks ===
=== 深度储蓄池计算(Deep reservoir computing) ===
深度储蓄池计算和深度回声状态网络 (deepESNs)为高效训练的分层处理时序数据的模型提供了一个框架,同时使RNN的层次化构成的内在作用能够探查。
=== 深度置信网络(Deep belief networks) ===
【File:Restricted_Boltzmann_machine.svg|thumb|一个带有全连接可见和隐藏单元的【受限玻尔兹曼机】 (RBM) 。注意没有隐藏-隐藏和可见-可见连接。】
一个深度置信网络(DBN)是一个概率的【生成模型】,它由多层隐藏层组成。可以被认为是一个组成每一层的简单学习模块的【组合】。
一个DN可以被用于生成地预训练一个DNN,通过使用学习的DBN权重和初始DNN权重。
反向传播或其他差别算法就可以调整这些权重。当训练数据有限时特别有用,因为很差的初始化的权重可以显著阻碍模型表现。这些预训练的权重在权重空间的范围内,这个权重空间距离最优权重比随机选择的权重更近。这允许既提高模型表现又加快好的调整相位收敛。
===大内存和检索神经网络===
大内存和检索神经网络(LAMSTAR)是多层快速深度学习神经网络,可以同时使用许多滤波。这些滤波可能非线性,随机,逻辑,【非固定】甚至非解析。它们是生物学动机的并且可以连续学习。
LAMSTAR神经网络可以作为在空间或时间或二者兼具的域内的动力神经网络。它的速度由【赫布(Hebbian)】连接权重提供,它整合多种并且通常不同的滤波(预处理函数)到它的与给定学习任务相关的很多层和函数中。这很大程度模拟了整合多种预处理器(【耳蜗】,【视网膜】等)和皮层(听觉,视觉等)和它们的多个域的生物学习。通过使用抑制,相关,它的深度学习能力大大增强,甚至当在任务中时,处理不完整数据的能力或“丢失的”神经元或层的能力也显著增强。由于它的连接权重,它是完全透明的。这些连接权重允许动态地决定更新和去除,并且帮助任务相关的层,滤波或单独神经元的排列。
LAMSTAR被应用于多个领域,包括医药和金融预测,在未知噪音下嘈杂语音的适应性滤波,静态图像识别,视频图像识别,软件安全和非线性系统的适应性控制。LAMSTAR比基于【ReLU】函数滤波和最大池化的CNN在20个对比研究中有明显更快的学习速度,和稍低的错误率。
这些应用展示了钻入数据藏在浅学习网络和人类感觉下的面貌,如预测【睡眠呼吸中止症】,怀孕早期从放在母亲腹部皮肤表面电极记录的胎儿心电图,金融预测或者嘈杂语音的盲过滤的案例。
LAMSTAR在1996被提议(【US Patent|5920852 A】),然后从1997到2002被Graupe和Kordylewski深入开发。一个更改的版本称为LAMSTAR2,被Schneider 和 Graupe在2008开发。
=== 叠加(去噪)自动编码器(Stacked (de-noising) auto-encoders) ===
【自动编码器】的想法由“好的”表示的概念启发。例如对于一个【分类器】,一个好的表示可以被定义为一个产生了更好表现的分类器。
【编码器】是一个确定映射 <math>f_\theta</math> ,它将输入向量''''' x'''''转化为隐藏表示 '''''y''''', 其中 <math>\theta = \{\boldsymbol{W}, b\}</math>, <math>\boldsymbol{W}</math> 是 权重矩阵, '''b''' 是一个补偿向量(偏置)。 【解码器】反映射隐藏表示 '''y'''到重建的输入 '''''z''''' 通过 <math>g_\theta</math>。整个自动编码的过程是把这个重建输入与原始的作比较,尽量最小化误差使得重建值和原始尽可能的靠近 。
在叠加去噪编码器中,部分【corrupted】输出被清理(去噪),这个想法在2010由Vincent et al提出,使用特殊的好的表示的方法,一个好的表示是可以从【corrupted】输入【鲁棒地】得到,这对恢复相应清洁的输入有用。这个定义隐含了下面的想法:
* 更高层的表征相对而言对输入【corruption】稳定和鲁棒;
* 选出对输入分布表征有用的特征是必要的。
这个算法通过<math>q_D(\tilde{\boldsymbol{x}}|\boldsymbol{x})</math>从 <math>\boldsymbol{x}</math> 到<math>\tilde{\boldsymbol{x}}</math> 的随机映射开始,这是【corrupting】步。然后【corrupted】输入 <math>\tilde{\boldsymbol{x}}</math> 传过基本自动编码过程,并被映射到隐含表示<math>\boldsymbol{y} = f_\theta(\tilde{\boldsymbol{x}}) = s(\boldsymbol{W}\tilde{\boldsymbol{x}}+b)</math>。从这个隐含表示中,我们可以重建<math>\boldsymbol{z} = g_\theta(\boldsymbol{y})</math>。在最后一步,一个最小化算法运行以使 '''''z'''''尽可能和【uncorrupted】输入<math>\boldsymbol{x}</math>近。重建误差<math>L_H(\boldsymbol{x},\boldsymbol{z})</math>可以是带有双弯曲仿射解码器的【交叉熵】损失,或者【仿射】解码器的平方误差。
为了做出一个深度结构,自动编码器栈。一旦第一个去噪自动编码器的编码函数<math>f_\theta</math>被学习并且用于改善输入(差的输入),第二级可以被训练。
一旦叠加自动编码器被训练,它的输出可以被用作【监督学习】算法,如【支持向量机】分类器或一个多分类【逻辑回归】的输入。
=== Deep reservoir computing ===
===深度叠加网络( Deep stacking networks )===
深度叠加网络 (DSN)(深度凸网络)是基于多块的简化神经网络模块的层级。在2011被Deng和Dong引入。它用带【闭型解】的【凸优化】表达学习,强调机制与【层叠泛化】的相似。 每个DSN块是一个容易被【监督】式自我训练的简单模块,不需要整个块的反向传播。
每块由一个简化的带单隐层的【多层感知机】(MLP)组成。隐藏层 '''''h''''' 有逻辑【双弯曲的】【单元】,输出层有线性单元。这些层之间的连接用权重矩阵'''''U;'''''表示,输入到隐藏层连接有权重矩阵 '''''W'''''。目标向量'''''t''''' 形成矩阵 '''''T'''''的列, 输入数据向量 '''''x'''''形成矩阵 '''''X.''''' 的列。隐藏单元的矩阵是<math>\boldsymbol{H} = \sigma(\boldsymbol{W}^T\boldsymbol{X})</math>. 。模块按顺序训练,因此底层的权重 '''''W''''' 在每一阶段已知。函数执行对应元素的【逻辑双弯曲】操作。每块估计同一个最终标记类 ''y'',这个估计被原始输入'''''X''''' 串级起来,形成下一个块的扩展输入。因此第一块的输入只包含原始输入,而下游的块输入加上了前驱块的输出。然后学习上层权重矩阵 '''''U''''' ,给定网络中其他权重可以被表达为一个凸优化问题:
: <math>\min_{U^T} f = ||\boldsymbol{U}^T \boldsymbol{H} - \boldsymbol{T}||^2_F,</math>
: <math>\min_{U^T} f = ||\boldsymbol{U}^T \boldsymbol{H} - \boldsymbol{T}||^2_F,</math>
,它有闭型解。
不像其他如DBN的深度结构,它的目标不是找到转化的【特征】表示。这种层级的结构使并行学习更简单了,正如批处理模式优化问题。在完全【判别任务】中,DSN比传统的【深度置信网络】(DBN)表现更好。
=== Spike-and-slab RBMs ===
=== 张量深度叠加网络(Tensor deep stacking networks) ===
这个结构是 DSN 的延伸.。它提供了两个重要的改善:使用来自【协方差】统计的更高序的信息,并且将低层【非凸问题】转化为一个更高层的凸子问题。TDSN在【双线性映射】中,通过一个第三序的【张量】,从预测同一层的两个不同隐藏单元集合使用协方差统计。
在传统DNN中,并行性和可扩展性不被认为是严重的。DSN和TDSN中所有的学习使用批处理模式, 允许并行化。并行化允许放大这种设计到更大(更深)的结构和数据集。
基本结构适用于多种任务如【分类】和【回归】。
=== 钉板受限玻尔兹曼机(Spike-and-slab RBMs) ===
深度学习有带【实值】输入的需要,如在高斯受限玻尔兹曼机中一样,引出了“钉板”【受限玻尔兹曼机】,它模拟带严格【二进制】【潜变量】的连续值输入。与基本【RBM】和它的变体一样,钉板【RBM】是【二分图】,好像GRBM一样,可见单元(输入)是实值的。
区别在隐藏层,每个隐藏单元有二进制的发放值【?】和实值的平滑值【?】。spike是一个离散的在零处的【概率质量】,slab是一个连续域上的【概率密度】,它们的混合形成了【先验】。
ss【RBM】的一个扩展是µ-ss[【RBM】,使用【能量函数】中的附加项提供了额外的建模能力。这些项之一使模型形成了spike值的【条件分布】,通过给定一个观测值【边际化出】slab值。
=== 混合层级深度模型(Compound hierarchical-deep models) ===
混合层级深度模型构成了带非参数【贝叶斯模型】的深度网络。【特征】可以使用像DBN,深度自动编码器,卷积变体,ssRAM,深度编码网络,带稀疏特征学习的DBN,RNN,条件DBN,去噪自动编码器的深度结构学习 。这提供了更好的表示,允许更快的学习和高维数据下更精确的分类。然而,这些结果在学习带少示例的异常类时表现很差,因为所有的网络单元都参与表示输入(分布式表征)并且必须一起被调整(高【自由度】)。限制自由度减少了要学习的参数数量,使从新的例子中的新的类学习更容易。【层次贝叶斯模型】允许从少量示例中学习,例如计算机视觉,【统计学】 和认知科学。
混合HD结构目的是整合HB和深度网络的特征。混合HDP-DBM结构是一种作为层级模型的【层级狄利克雷过程】与DBM结构合并。这是全【生成模型】,从流经模型层的抽象概念中生成,它可以分析在异常类中看起来“合理的”自然的新例子。所以的层级通过最大化一个共同【对数概率分数】被共同学习。
在有三层隐藏层的DBM中,可见输入'''{{mvar|ν}}'''的概率是 :
: <math>p(\boldsymbol{\nu}, \psi) = \frac{1}{Z}\sum_h e^{\sum_{ij}W_{ij}^{(1)}\nu_i h_j^1 + \sum_{jl}W_{jl}^{(2)}h_j^{1}h_l^{2}+\sum_{lm}W_{lm}^{(3)}h_l^{2}h_m^{3}},</math>
: <math>p(\boldsymbol{\nu}, \psi) = \frac{1}{Z}\sum_h e^{\sum_{ij}W_{ij}^{(1)}\nu_i h_j^1 + \sum_{jl}W_{jl}^{(2)}h_j^{1}h_l^{2}+\sum_{lm}W_{lm}^{(3)}h_l^{2}h_m^{3}},</math>
其中 <math>\boldsymbol{h} = \{\boldsymbol{h}^{(1)}, \boldsymbol{h}^{(2)}, \boldsymbol{h}^{(3)} \}</math> 是隐藏单元的集合, <math>\psi = \{\boldsymbol{W}^{(1)}, \boldsymbol{W}^{(2)}, \boldsymbol{W}^{(3)} \} </math> 是模型参数, 代表可见-隐藏和隐藏-隐藏对称交互作用项。
一个学习后的DBM模型是一个定义联合分布的无向模型<math>P(\nu, h^1, h^2, h^3)</math>. 一种表达学到的东西的方式是【条件模型】 <math>P(\nu, h^1, h^2|h^3)</math> 和一个先验项 <math>P(h^3)</math>.
这里 <math>P(\nu, h^1, h^2|h^3)</math> 代表一个条件DBM网络,它可以被看成两层DBM,但带有<math>h^3</math>状态给出的偏置项 :
: <math>P(\nu, h^1, h^2|h^3) = \frac{1}{Z(\psi, h^3)}e^{\sum_{ij}W_{ij}^{(1)}\nu_i h_j^1 + \sum_{jl}W_{jl}^{(2)}h_j^{1}h_l^{2}+\sum_{lm}W_{lm}^{(3)}h_l^{2}h_m^{3}}.</math>
: <math>P(\nu, h^1, h^2|h^3) = \frac{1}{Z(\psi, h^3)}e^{\sum_{ij}W_{ij}^{(1)}\nu_i h_j^1 + \sum_{jl}W_{jl}^{(2)}h_j^{1}h_l^{2}+\sum_{lm}W_{lm}^{(3)}h_l^{2}h_m^{3}}.</math>
=== Deep predictive coding networks ===
=== 深度预测编码网络(Deep predictive coding networks) ===
深度预测编码网络 (DPCN)是一个【预测】编码体系,它使用自顶向下信息,经验为主地调整自底向上【推理】过程需要的先验,通过一个深度局部连接的【生成模型】 。这通过使用线性动态模型,从不同时间的观测值提取稀疏【特征】工作。然后一个池化策略被用于学习不变的特征表示。这些单元组成一种【贪心】按层间【无监督学习】训练的深度结构 。这些层构成一种【马尔科夫链】因而任何层的状态只依赖前面和后面的层。
DPCN通过使用自顶向下方法用顶层的信息和过去状态的空间依赖预测层的表征。
DPCN可以被扩展形成一个【卷积网络】。
===== Neural Turing machines =====
【记忆 or 内存?多处。】
=== 带单独记忆结构的网络(Networks with separate memory structures) ===
使用ANN整合外部记忆可以追溯到关于分布表征和【Kohonen】的【自组织映射】的早期研究。例如, 在【稀疏分布式记忆】或【层级空间记忆】中,神经网络编码的模式被用于【可寻址内容的记忆】的地址,使用“神经元”本质上作为地址 【编码器】和【解码器】。 然而早期这种记忆的控制器不可微。
====LSTM相关的可微记忆结构(LSTM-related differentiable memory structures) ====
除了【长短期记忆】(LSTM), 其他方法也在循环函数中加入可微记忆,例如:
* 交替记忆网络的可微的推和弹动作,称为神经栈机器
* 控制网络的外部可微存储在其他网络的快速幂中的记忆网络。
* LSTM遗忘门
* 带用于寻址和在可微样式(内部存储)快速操作RNN自身权重的特殊输出单元的自我参照的RNN。
* 学习带无界记忆的转换。
==== Semantic hashing ====
===== 神经图灵机(Neural Turing machines) =====
神经图灵机将LSTM网络与外部记忆资源结合,这样他们可以通过注意过程相互影响。这种组合系统和【图灵机】相似但是端到端可微,允许使用【梯度下降】有效训练 。初步结果表明神经图灵机可以推断简单算法,如复制,排序和从输入输出例子的联想回忆。
【可微神经计算机】(DNC)是一个NTM的延伸。他们在序列处理任务中表现超过神经图灵机,【长短期记忆】系统和记忆网络。
==== Memory networks ====
==== 语义哈希(Semantic hashing )====
直接代表过去经验,【使用相同经验形成局部模型】的方法通常称为【最近邻】或【k最近邻】方法。深度学习在语义哈希中十分有用,其中一个深度【图模型】建模由一个大的文档集中获取的字数向量。文档映射到内存地址,这样语义相似的文档位于临近的地址。与查询文档相似的文档可以通过访问所有仅来自查询文档地址的几位不同的地址找到。不像在1000位地址上操作的【稀疏分布记忆】,语义哈希在常见计算机结构的32或64位地址上工作。
==== Pointer networks ====
==== 记忆网络(Memory networks) ====
记忆网络是神经网络结合【长期记忆】的另一个扩展。长期记忆可以可以被读写,目的是用来预测。这些模型用于【问题回答】,其中长期记忆有效地作为(动态)知识基础,输出是文本回应。一个来自UCLA萨穆埃利工程学院的电子和计算机工程师团队做出了一种物理人工神经网络。它可以在实际光速下分析大量数据并识别物体。
==== Encoder–decoder networks ====
==== 指针网络(Pointer networks) ====
深度神经网络可能通过在维持可训练性的同时,加深和减少参数改进。当训练十分深(例如一百万层)神经网络可能不可行,类【CPU】结构如指针网络和神经随机访问机器通过使用外部【随机访问内存】和其他属于【计算机组成】的组件,如【寄存器】,【ALU】和【指针】解决了这个限制。这种系统在储存在记忆单元和寄存器中的【概率分布】向量上操作。这样,模型是全可微并且端到端训练的。这些模型的关键特点是它们的深度,它们短期记忆的大小和参数的数量可以独立切换——不像类似LSTM的模型,它们的参数数量随内存大小二次增长。
=== Multilayer kernel machine ===
==== 编码解码网络(Encoder–decoder networks )====
编码解码框架是基于从高度【结构化】输入到高度结构化输出的映射的神经网络。这种方法在【机器翻译】的背景下被提出,它的输入和输出是使用两种自然语言写成的句子。在这个工作中,LSTM RNN或CNN被用作编码机,来总结源语句,这个总结被条件RNN【语言模型】解码来产生翻译。这些系统共享建立的模块:门限RNN,CNN,和训练的注意机制。
=== 多层核机器(Multilayer kernel machine) ===
* rank the <math>n_l</math> features according to their [[mutual information]] with the class labels;
多层核机器 (MKM) 是通过迭代应用弱非线性核学习高度非线性函数的方法。它们使用【核主成分分析】 (KPCA),作为一种【无监督】贪心的逐层预训练步深度学习方法。
* for different values of ''K'' and <math>m_l \in\{1, \ldots, n_l\}</math>, compute the classification error rate of a ''[[K-nearest neighbor]] (K-NN)'' classifier using only the <math>m_l</math> most informative features on a [[validation set]];
学到前面层 <math>l</math>的特征, 提取在核产生特征域的投影层 <math>l</math>的<math>n_l</math>【主成分】(PC) 。为了寻找每层更新表征的【降维】,【监督策略】从KPCA提取的特征中选择最佳有益特征。过程是:
* the value of <math>m_l</math> with which the classifier has reached the lowest error rate determines the number of features to retain.
*排序 <math>n_l</math> 特征,根据它们带类标签的【交互信息】;
* 对 ''K'' 和 <math>m_l \in\{1, \ldots, n_l\}</math>的不同值,计算【k最近邻】(K-NN)分类器的分类错误率,在【验证集】中只使用 <math>m_l</math>最有益特征;
* 使分类器达到最低错误率的<math>m_l</math> 的值决定保持特征的数量。
KPCA方法的一些缺点是MKM的建立单元。
使用用于深度学习的核机器一个更直接的方法被发展,用于口语理解。主旨是使用核机器近似有无限隐藏单元的浅神经网络,然后使用【叠加】结合核机器的输出和核机器的建立下一个更高级的原始输入。深度凸网络的级数是整个系统的超参数,使用交叉验证确定。
== Neural architecture search ==
== 神经结构搜索(Neural architecture search) ==
神经结构搜索 (NAS)使用机器学习自动化ANN的设计。多种NAS的方法设计出了与手工设计系统很好媲美的网络。基本搜索算法是提议候选模型,使用数据集评价它并使用结果作为反馈教给NAS网络。
== Use ==
==使用 ==
使用ANN需要理解它们的特征。
* 模型的选择: 这取决于数据的表示和应用。过复杂的模型减慢学习。
* 学习算法: 学习算法之间存在多种交易。在特定数据集上训练时,只要有正确的【超参数】,几乎任何算法都会有效。但是,在不可见的数据上训练时,选择和调整算法需要许多试验。
* 鲁棒性: 如果适当地选择了模型,损失函数和学习算法,产生的ANN会是鲁棒的。
在以下宽泛的类别中,ANN的能力下降:
* 【函数逼近】或【回归分析】,包括【时间序列预测】,【适当逼近】和建模
* Choice of model: This depends on the data representation and the application. Overly complex models slow learning.
* 【分类】,包括【模式】和序列识别,【异常检测】和序列决策。
* Learning algorithm: Numerous trade-offs exist between learning algorithms. Almost any algorithm will work well with the correct [[hyperparameter]]s for training on a particular data set. However, selecting and tuning an algorithm for training on unseen data requires significant experimentation.
*【数据处理】, 包括滤波,聚类,【盲源分离】和压缩i
* Robustness: If the model, cost function and learning algorithm are selected appropriately, the resulting ANN can become robust.
* 【机器人】,包括指导操纵器和【假体】
* 【控制】,包括【计算机数控】
==应用==
由于他们重现和模拟非线性过程的能力,人工神经网络在广泛的领域建立了很多应用。
应用领域包括【系统识别】和控制(车辆控制,弹道预测,【过程控制】,【自然资源管理】),量子化学,玩游戏和【决策】(西洋双陆棋,国际象棋,扑克),【模式识别】(雷达系统,【人脸识别】,信号分类,物体识别和其他),序列识别(姿态,语音,手写和印刷文本),【医疗诊断】,金融(例如【自动交易系统】),【数据挖掘】,可视化,【机器翻译】,社交网络滤波和【垃圾邮件】滤波。
ANN被用于诊断癌症,包括【肺癌】,【前列腺癌】,【结肠直肠癌】和只使用细胞形状信息区分高度浸润性癌细胞系和较少浸润性系。
ANN被用于加速基础设施遭受自然灾害的可靠性分析。
ANN也被用于在【地球科学】中建立黑箱模型,【水文学】,海洋建模和【海岸工程】只是其中很少的几个例子。
===模型的类型===
许多类型的模型被使用,在不同级定义的抽象概念并建模神经系统的不同方面。他们包括从【个体神经元】短期行为的模型,神经环路动力学如何从个体神经元交互中产生的模型,到行为如何从代表完整子系统的抽象神经模块中产生的模型。这些包括神经系统和它们与从个体神经元到系统层面学习、记忆的关系的长期,短期可塑性模型。
==理论性质(Theoretical properties)==
===计算能力(Computational power)===
【多层感知机】是一个通用函数逼近器, 被【通用逼近理论】证明。然而,考虑到所需神经元的数量,网络拓扑,权重和学习参数,证明是没有建设性的。
一种特殊的带有理值权(与全精度【实数】值权相对)的循环结构具有一个【通用图灵机】的完整能力,通过使用有限数量的神经元和标准线性连接。另外,无理值权导致机器带有【超图灵】能力。
==Theoretical properties==
===能力(Capacity)===
模型的 "能力" 性质大概对应于它们建模任意给定函数的能力。这与能被储存在网络的信息量和复杂性的概念相关。
===收敛(Convergence)===
模型可能不一致收敛于一个单独解,首先由于可能存在许多局部最小值,取决于损失函数和模型。其次,当从距离任何局部最小值较远处开始时,使用的优化方法可能不保证收敛。再次,对于足够大的数据或参数,一些方法变得不可行。然而,对于【CMAC】神经网络,引入递推最小二乘算法训练它,这个算法可以保证一步收敛。
===泛化和统计(Generalization and statistics)===
目的是建立对未见例子泛化较好系统的应用,面临过度训练的可能。这当网络能力显著超过所需的自由参数时,在复杂的或过特殊的系统出现。有两种处理过度训练的方法。第一种是使用【交叉验证】和相似的技术检查过度训练的存在并最佳选择超参数最小化泛化误差。第二种是使用某种形式的【正则化】。这个概念在概率的(贝叶斯)框架中产生,其中正则化可以通过选择更大的对更简单模型的先验概率实现;但是在统计学习理论中,目标是最小化两个数量:‘经验风险’和‘结构风险’,它们大概对应于训练集上的误差和在未见数据中由于过拟合的预测误差。
【File:Synapse deployment.jpg|thumb|right|一个神经网络的置信度分析】
使用【均方误差】(MSE)损失函数的监督神经网络可以使用正式的统计方法来确定训练好的模型的置信度。在验证集上的MSE可以被用作方差的估计。这个值接着可以被用于计算网络输出的【置信区间】,假定【正态分布】的情况下。这样的置信度分析只要输出【概率分布】保持相同,网络没有被改变,就是统计学有效的。
通过将一个【柔性最大值传输激活函数】,一个【逻辑函数】的泛化,分配给用于绝对目标值的神经网络的输出层(或在基于组件神经网络的柔性最大值传输函数组件),输出可以被理解为后验概率。这在分类中十分有用,因为它在分类中给出了确定的测量。
柔性最大值传输函数的激活函数是:
:<math>y_i=\frac{e^{x_i}}{\sum_{j=1}^c e^{x_j}}</math>
:<math>y_i=\frac{e^{x_i}}{\sum_{j=1}^c e^{x_j}}</math>
==Criticism==
==批评==
===Training issues===
===训练问题(Training issues)===
一个对神经网络通常的批评,特别是在机器人领域,是它们需要太多训练才能在真实世界中操作。潜在的解决方法包括随机混排训练例子,在根据一个例子改变网络连接时,通过使用不走过大步的数值优化算法和分组例子成微型批次。提高训练效率和收敛能力一直是神经网络前进的研究领域。例如通过在【CMAC】神经网络中引入递推最小二乘算法, 训练过程只需要一步收敛。
===理论问题(Theoretical issues)===
===理论问题(Theoretical issues)===
没有神经网络解决了计算困难的问题例如【八皇后】问题,【旅行商问题】或【整数因子分解】对于大整数的问题。
没有神经网络解决了计算困难的问题例如【八皇后】问题,【旅行商问题】或【整数因子分解】对于大整数的问题。
一个根本的缺点是它们不反映真实神经元如何运行。反向传播是多数人工神经网络的一个批评部分,尽管生物神经网络中没有这种机制存在。真实神经元是如何编码信息是未知的。带有感觉激活的【感觉神经元】发放【动作电位】更频繁,【肌细胞】相关联的【运动神经元】接收动作电位更频繁时,它们也牵拉更强烈。不同于从感觉神经元到运动神经元的信息传播,对于信息如何被生物神经网络处理的原则几乎毫无了解。
ANN背后的动机不必要严格复制神经功能,但是使用生物神经网络是一种启发。ANN的一个主要要求是体现一些新的、强大的处理信息的通用原则。不幸的是,这些通用原则是被不好定义的。通常声称它们是从网络自身【突现】的。这允许简单统计关联(人工神经网络的基本函数被描述为学习或识别)。结果【Alexander Dewdney】评论道,人工神经网络有一种“不劳而获的利益特性,它给予特殊的懒惰气氛,区别缺少关于这些计算系统多么好的好奇。没有人类的手(或思维)干涉;解好像通过魔法一样得到;看起来没有人学到了任何东西”
正如大脑解剖记录的那样,生物的大脑使用浅的和深的环路,显示出广泛的不变性。Weng 反驳说大脑自己的线路主要根据信号统计,因此连续串联不能捕获所有主要统计依赖。
===硬件问题(Hardware issues)===
大而有效的神经网络需要相当大的计算资源。大脑有为信号处理任务定制的硬件,通过神经元的【图】,在【冯诺依曼结构】中模拟简化的神经元可能迫使神经网络设计者填充数百万的【数据库】行为了它的连接{{snd}}它可以消耗大量【内存】和存储。另外,设计者通常需要在许多这种连接和它们相关的神经元间传输信号{{snd}}这必须总是与巨大的【CPU】处理能力和时间相匹配。
【Schmidhuber】表示二十一世纪神经网络的再起主要由于硬件的进步:从1991到2015,计算能力,特别是由【GPGPU】(在GPU上)递送,增长了大约一百万倍,使得标准反向传播算法对于训练比从前深几层的网络可行。并行GPU的使用可以将训练时间从几个月到几天。
【神经形态工程】通过构造非冯诺依曼芯片直接用电路实现神经网络,来直接强调硬件的困难。另一个处理神经网络的优化芯片称为【张量处理单元】或TPU。
===对批评的实际反例===
反驳 Dewdney观点的争论是神经网络成功地用于解决许多复杂且多变的任务,范围从自动飞行飞机到检测信用卡诈骗到掌握【Go】游戏。
科技作者Roger Bridgman评论道:
<blockquote>神经网络,举个例子,成为众矢之的不仅因为它们被炒作到了天上, (什么没有?) 但也因为你可以创造一个成功的网络而不需要理解它如何工作: 捕获了它行为的一串数字可能完全是 "一个透明的,难以理解的表格...作为科技资源毫无价值"。
尽管他着重声明科学不是技术, Dewdney看起来【pillory?】神经网络作为坏的科学,当多数发明它们的人只是尽力成为好的工程师。一个有用的机器可以阅读的难以理解的表格可能仍然值得拥有。
<blockquote>Neural networks, for instance, are in the dock not only because they have been hyped to high heaven, (what hasn't?) but also because you could create a successful net without understanding how it worked: the bunch of numbers that captures its behaviour would in all probability be "an opaque, unreadable table...valueless as a scientific resource".
</blockquote>
</blockquote>
尽管分析一个人工神经网络学到了什么很困难,这样做比分析一个生物的神经网络容易的多。此外,参与探索神经网络学习算法的研究者正渐渐找出使学习机器成功的通用准则。例如局部还是非局部学习,浅还是深度结构。
===混合方法(Hybrid approaches)===
===混合方法(Hybrid approaches)===
混合模型(结合了神经网络和符号化方法)的拥护者声称这种混合可以更好地捕获人类大脑的机制
混合模型(结合了神经网络和符号化方法)的拥护者声称这种混合可以更好地捕获人类大脑的机制
==类型==
==类型(Types)==
<!-- Split to [[Types of artificial neural networks]] -->
{{Main|Types of artificial neural networks}}
人工神经网络有很多类型。最简单的静态类型有一个或多个静态部分,包括一些单元,一些层,单元权重和【拓扑学】。动态类型允许这些中的一个或多个在学习过程中变化。后者更复杂,但是可以缩短学习时长并且产生更好的结果。一些类型允许/需要被操作“监督”,而另一些操作独立。一些类型的操作完全在硬件层面,而其他的完全在软件而且在通用计算机上运行。
==图片==
==图片==