更改

无监督学习 (查看源代码)

2022年6月29日 (三) 14:32的版本

添加54,391字节、 2022年6月29日 (三) 14:32

Moved page from wikipedia:en:Unsupervised learning (history)

此词条暂由彩云小译翻译，翻译字数共2389，未经人工整理和审校，带来阅读不便，请见谅。

{{Short description|Machine learning technique}}
{{Machine learning|Problems}}

'''Unsupervised learning''' is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a compact internal representation of its world and then generate imaginative content from it. In contrast to [[supervised learning]] where data is tagged by an expert, e.g. as a "ball" or "fish", unsupervised methods exhibit self-organization that captures patterns as probability densities <ref name=Hinton99a /> or a combination of neural feature preferences. The other levels in the supervision spectrum are [[reinforcement learning]] where the machine is given only a numerical performance score as guidance, and [[semi-supervised learning]] where a smaller portion of the data is tagged.

Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a compact internal representation of its world and then generate imaginative content from it. In contrast to supervised learning where data is tagged by an expert, e.g. as a "ball" or "fish", unsupervised methods exhibit self-organization that captures patterns as probability densities or a combination of neural feature preferences. The other levels in the supervision spectrum are reinforcement learning where the machine is given only a numerical performance score as guidance, and semi-supervised learning where a smaller portion of the data is tagged.

非监督式学习是一种从未标记的数据中学习模式的算法。希望通过模仿，这是人类学习的一种重要模式，机器被迫建立一个紧凑的世界内部表示，然后从中产生富有想象力的内容。与专家为数据加上标签的监督式学习不同，例如:。作为一个“球”或“鱼”，无监督的方法展示了自我组织，捕捉模式作为概率密度或神经特征偏好的组合。监督范围内的其他级别的强化学习是，机器只获得一个数字性能评分作为指导，而半监督学习则是标记了一小部分数据。

Two broad methods in unsupervised learning are [[neural networks]] and [[Probabilistic methods in artificial intelligence|probabilistic methods]].

Two broad methods in unsupervised learning are neural networks and probabilistic methods.

非监督式学习分析的两种主要方法是神经网络和概率方法。

== Neural networks ==

== Neural networks ==

= = 神经网络 = =

=== Tasks vs. methods ===

=== Tasks vs. methods ===

= = = 任务 vs 方法 = = =

[[File:Task-guidance.png|thumb|300px|Tendency for a task to employ Supervised vs. Unsupervised methods. Task names straddling circle boundaries is intentional. It shows that the classical division of imaginative tasks (left) employing unsupervised methods is blurred in today's learning schemes.]]

thumb|300px|Tendency for a task to employ Supervised vs. Unsupervised methods. Task names straddling circle boundaries is intentional. It shows that the classical division of imaginative tasks (left) employing unsupervised methods is blurred in today's learning schemes.

使用有监督和无监督方法的任务倾向。跨越圆形边界的任务名称是有意为之。结果表明，在当今的学习方案中，使用非监督方法的想象任务(左)的经典划分是模糊的。

[[Neural network]] tasks are often categorized as discriminative (recognition) or generative (imagination). Often but not always, discriminative tasks use supervised methods and generative tasks use unsupervised (see [[Venn diagram]]); however, the separation is very hazy. For example, object recognition favors supervised learning but unsupervised learning can also cluster objects into groups. Furthermore, as progress marches onward some tasks employ both methods, and some tasks swing from one to another. For example, image recognition started off as heavily supervised, but became hybrid by employing unsupervised pre-training, and then moved towards supervision again with the advent of dropout, relu, and adaptive learning rates.

Neural network tasks are often categorized as discriminative (recognition) or generative (imagination). Often but not always, discriminative tasks use supervised methods and generative tasks use unsupervised (see Venn diagram); however, the separation is very hazy. For example, object recognition favors supervised learning but unsupervised learning can also cluster objects into groups. Furthermore, as progress marches onward some tasks employ both methods, and some tasks swing from one to another. For example, image recognition started off as heavily supervised, but became hybrid by employing unsupervised pre-training, and then moved towards supervision again with the advent of dropout, relu, and adaptive learning rates.

神经网络任务通常分为辨别(识别)或生成(想象)两类。通常情况下，区分任务使用监督方法，生成任务使用无监督方法(参见维恩图) ; 然而，分离是非常模糊的。例如，对象识别有利于监督式学习，但是非监督式学习也可以将对象聚集成群。此外，随着进度的推进，一些任务同时使用这两种方法，而一些任务则从一个任务转到另一个任务。例如，图像识别一开始是受到严格监督的，但后来通过采用无监督的预先培训变成了混合的，然后随着辍学率、再入学率和在线机机器学习率的出现再次走向监督。

=== Training ===
During the learning phase, an unsupervised network tries to mimic the data it's given and uses the error in its mimicked output to correct itself (ie. correct its weights & biases). This resembles the mimicry behavior of children as they learn a language. Sometimes the error is expressed as a low probability that the erroneous output occurs, or it might be expressed as an unstable high energy state in the network.

During the learning phase, an unsupervised network tries to mimic the data it's given and uses the error in its mimicked output to correct itself (ie. correct its weights & biases). This resembles the mimicry behavior of children as they learn a language. Sometimes the error is expressed as a low probability that the erroneous output occurs, or it might be expressed as an unstable high energy state in the network.

在学习阶段，一个无监督的网络试图模仿它给出的数据，并使用其模仿输出中的错误来纠正自己(即。修正其权重及偏差)。这类似于儿童学习语言时的模仿行为。有时，误差表示为错误输出发生的概率很低，或者可以表示为网络中不稳定的高能状态。

In contrast to supervised method's dominant use of [[backpropagation]], unsupervised learning also employs other methods including: Hopfield learning rule, Boltzmann learning rule, Contrastive Divergence, Wake Sleep, Variational Inference, Maximum Likelihood, Maximum A Posteriori, Gibbs Sampling, and backpropagating reconstruction errors or hidden state reparameterizations. See the table below for more details.

In contrast to supervised method's dominant use of backpropagation, unsupervised learning also employs other methods including: Hopfield learning rule, Boltzmann learning rule, Contrastive Divergence, Wake Sleep, Variational Inference, Maximum Likelihood, Maximum A Posteriori, Gibbs Sampling, and backpropagating reconstruction errors or hidden state reparameterizations. See the table below for more details.

与监督方法主要使用的反向传播相比，非监督式学习还使用了其他方法，包括: Hopfield 学习规则，Boltzmann 学习规则，对比发散，唤醒睡眠，变分推理，最大似然，最大后验概率，吉布斯抽样，反向传播重建错误或隐藏状态重新参数化。有关详细信息，请参阅下表。

=== Energy ===
An energy function is a macroscopic measure of a network's activation state. In Boltzmann machines, it plays the role of the Cost function. This analogy with physics is inspired by Ludwig Boltzmann's analysis of a gas' macroscopic energy from the microscopic probabilities of particle motion p <math>\propto</math> eE/kT, where k is the Boltzmann constant and T is temperature. In the RBM network the relation is p = e−E / Z,<ref name="Hinton2010" /> where p & E vary over every possible activation pattern and Z = <math> \sum_{AllPatterns} </math> e -E(pattern). To be more precise, p(a) = e-E(a) / Z, where a is an activation pattern of all neurons (visible and hidden). Hence, early neural networks bear the name Boltzmann Machine. Paul Smolensky calls -E the Harmony. A network seeks low energy which is high Harmony.

An energy function is a macroscopic measure of a network's activation state. In Boltzmann machines, it plays the role of the Cost function. This analogy with physics is inspired by Ludwig Boltzmann's analysis of a gas' macroscopic energy from the microscopic probabilities of particle motion p \propto eE/kT, where k is the Boltzmann constant and T is temperature. In the RBM network the relation is p = e−E / Z, where p & E vary over every possible activation pattern and Z = \sum_{AllPatterns} e -E(pattern). To be more precise, p(a) = e-E(a) / Z, where a is an activation pattern of all neurons (visible and hidden). Hence, early neural networks bear the name Boltzmann Machine. Paul Smolensky calls -E the Harmony. A network seeks low energy which is high Harmony.

= = = 能量 = = = 能量函数是衡量网络活化状态的宏观指标。在玻尔兹曼机器中，它起着成本函数的作用。这种与物理学的类比灵感来自于路德维希·玻尔兹曼对气体宏观能量的分析，这种能量来自于粒子运动的微观概率 p/p/eE/kT，其中 k 是波兹曼常数，t 是温度。在 RBM 网络中，关系是 p = e-E/Z，其中 p & E 随每个可能的激活模式而变化，Z = sum _ { AllPattern } e-E (pattern)。更准确地说，p (a) = e-E (a)/Z，其中 a 是所有神经元(可见和隐藏)的激活模式。因此，早期的神经网络被称为“波茨曼机”。Paul Smolensky 呼唤和谐。网络追求的是低能量，高和谐。

=== Networks ===
This table shows connection diagrams of various unsupervised networks, the details of which will be given in the section Comparison of Network. Circles are neurons and edges between them are connection weights. As network design changes, features are added on to enable new capabilities or removed to make learning faster. For instances, neurons changes between deterministic (Hopfield) and stochastic (Boltzmann) to allow robust output, weights are removed within a layer (RBM) to hasten learning, or connections are allowed to become asymmetric (Helmholtz).

This table shows connection diagrams of various unsupervised networks, the details of which will be given in the section Comparison of Network. Circles are neurons and edges between them are connection weights. As network design changes, features are added on to enable new capabilities or removed to make learning faster. For instances, neurons changes between deterministic (Hopfield) and stochastic (Boltzmann) to allow robust output, weights are removed within a layer (RBM) to hasten learning, or connections are allowed to become asymmetric (Helmholtz).

本表显示各种无监督网络的连接图，其详细情况将在网络比较一节中给出。圆圈是神经元，它们之间的边是连接权重。随着网络设计的改变，增加了新的功能或删除了新的功能，使学习更快。例如，神经元在确定性(Hopfield)和随机(Boltzmann)之间变化以允许强大的输出，在一层(RBM)内去除权重以加速学习，或者允许连接变得不对称(Helmholtz)。

{| class="wikitable"
|-
! Hopfield !! Boltzmann !! RBM !! Stacked Boltzmann || Helmholtz !! Autoencoder || VAE
|-
| [[File:Hopfield-net-vector.svg |thumb|A network based on magnetic domains in iron with a single self-connected layer. It can be used as a content addressable memory.]]
|| [[File:Boltzmannexamplev1.png |thumb|Network is separated into 2 layers (hidden vs. visible), but still using symmetric 2-way weights. Following Boltzmann's thermodynamics, individual probabilities give rise to macroscopic energies.]]
|| [[File:Restricted Boltzmann machine.svg|thumb|Restricted Boltzmann Machine. This is a Boltzmann machine where lateral connections within a layer are prohibited to make analysis tractable.]]
|| [[File:Stacked-boltzmann.png|thumb| This network has multiple RBM's to encode a hierarchy of hidden features. After a single RBM is trained, another blue hidden layer (see left RBM) is added, and the top 2 layers are trained as a red & blue RBM. Thus the middle layers of an RBM acts as hidden or visible, depending on the training phase it's in.]]
|| [[File:Helmholtz Machine.png |thumb|Instead of the bidirectional symmetric connection of the stacked Boltzmann machines, we have separate one-way connections to form a loop. It does both generation and discrimination.]]
|| [[File:Autoencoder_schema.png |thumb|A feed forward network that aims to find a good middle layer representation of its input world. This network is deterministic, so it's not as robust as its successor the VAE.]]
|| [[File:VAE blocks.png |thumb|Applies Variational Inference to the Autoencoder. The middle layer is a set of means & variances for Gaussian distributions. The stochastic nature allows for more robust imagination than the deterministic autoencoder. ]]
|}

{| class="wikitable"
|-
! Hopfield !! Boltzmann !! RBM !! Stacked Boltzmann || Helmholtz !! Autoencoder || VAE
|-
| thumb|A network based on magnetic domains in iron with a single self-connected layer. It can be used as a content addressable memory.
|| thumb|Network is separated into 2 layers (hidden vs. visible), but still using symmetric 2-way weights. Following Boltzmann's thermodynamics, individual probabilities give rise to macroscopic energies.
|| thumb|Restricted Boltzmann Machine. This is a Boltzmann machine where lateral connections within a layer are prohibited to make analysis tractable.
|| thumb| This network has multiple RBM's to encode a hierarchy of hidden features. After a single RBM is trained, another blue hidden layer (see left RBM) is added, and the top 2 layers are trained as a red & blue RBM. Thus the middle layers of an RBM acts as hidden or visible, depending on the training phase it's in.
|| thumb|Instead of the bidirectional symmetric connection of the stacked Boltzmann machines, we have separate one-way connections to form a loop. It does both generation and discrimination.
|| thumb|A feed forward network that aims to find a good middle layer representation of its input world. This network is deterministic, so it's not as robust as its successor the VAE.
|| thumb|Applies Variational Inference to the Autoencoder. The middle layer is a set of means & variances for Gaussian distributions. The stochastic nature allows for more robust imagination than the deterministic autoencoder.
|}

{| class="wikitable"
|-
!霍普菲尔德！玻尔兹曼！RBM! ！堆起来的波尔兹曼 | 亥姆霍兹! ！自动编码器 | | VAE |-| 拇指 | 一个网络的基础上磁域在铁与一个单一的自连接层。它可以用作内容寻址存储器。网络分为2层(隐藏和可见) ，但仍然使用对称的双向权重。遵循玻尔兹曼的热力学，个体的概率产生宏观的能量。受限玻尔兹曼机。这是一个禁止层内横向连接的波茨曼机，使分析变得易于处理。这个网络有多个 RBM 来编码隐藏特征的层次结构。在一个 RBM 被训练之后，另一个蓝色的隐藏层(见左边的 RBM)被添加，顶部的2个层被训练成红色和蓝色的 RBM。因此，成果管理的中间层起到隐藏或可见的作用，这取决于它所处的培训阶段。我们没有堆叠的玻尔兹曼机器的双向对称连接，而是用单向连接形成一个回路。它既能代代相传，又能区别对待。一个前馈网络，旨在找到一个良好的中间层代表其输入世界。这个网络是确定性的，所以它不像它的继任者 VAE 那样健壮。变分推理在自动编码器中的应用。中间层是高斯分布的一组均值和方差。随机性比确定性自动编码器允许更强大的想象力。|}

Of the networks bearing people's names, only Hopfield worked directly with neural networks. Boltzmann and Helmholtz came before artificial neural networks, but their work in physics and physiology inspired the analytical methods that were used.

Of the networks bearing people's names, only Hopfield worked directly with neural networks. Boltzmann and Helmholtz came before artificial neural networks, but their work in physics and physiology inspired the analytical methods that were used.

在有人名的网络中，只有 Hopfield 直接与神经网络一起工作。玻尔兹曼和亥姆霍兹出现在人工神经网络之前，但他们在物理学和生理学方面的工作启发了当时使用的分析方法。

=== History ===

=== History ===

= = 历史 = = =

{| class="wikitable"
|-
| 1969 || Perceptrons by Minsky & Papert shows a perceptron without hidden layers fails on XOR
|-
| 1970s || (approximate dates) AI winter I
|-
| 1974 || Ising magnetic model proposed by WA Little for cognition
|-
| 1980 || Fukushima introduces the neocognitron, which is later called a convolution neural network. It is mostly used in SL, but deserves a mention here.
|-
| 1982 || Ising variant Hopfield net described as CAMs and classifiers by John Hopfield.
|-
| 1983 || Ising variant Boltzmann machine with probabilistic neurons described by Hinton & Sejnowski following Sherington & Kirkpatrick's 1975 work.
|-
| 1986 || Paul Smolensky publishes Harmony Theory, which is an RBM with practically the same Boltzmann energy function. Smolensky did not give an practical training scheme. Hinton did in mid-2000s
|-
| 1995 || Schmidthuber introduces the LSTM neuron for languages.
|-
| 1995 || Dayan & Hinton introduces Helmholtz machine
|-
| 1995-2005 || (approximate dates) AI winter II
|-
| 2013 || Kingma, Rezende, & co. introduced Variational Autoencoders as Bayesian graphical probability network, with neural nets as components.
|}

{| class="wikitable"
|-
| 1969 || Perceptrons by Minsky & Papert shows a perceptron without hidden layers fails on XOR
|-
| 1970s || (approximate dates) AI winter I
|-
| 1974 || Ising magnetic model proposed by WA Little for cognition
|-
| 1980 || Fukushima introduces the neocognitron, which is later called a convolution neural network. It is mostly used in SL, but deserves a mention here.
|-
| 1982 || Ising variant Hopfield net described as CAMs and classifiers by John Hopfield.
|-
| 1983 || Ising variant Boltzmann machine with probabilistic neurons described by Hinton & Sejnowski following Sherington & Kirkpatrick's 1975 work.
|-
| 1986 || Paul Smolensky publishes Harmony Theory, which is an RBM with practically the same Boltzmann energy function. Smolensky did not give an practical training scheme. Hinton did in mid-2000s
|-
| 1995 || Schmidthuber introduces the LSTM neuron for languages.
|-
| 1995 || Dayan & Hinton introduces Helmholtz machine
|-
| 1995-2005 || (approximate dates) AI winter II
|-
| 2013 || Kingma, Rezende, & co. introduced Variational Autoencoders as Bayesian graphical probability network, with neural nets as components.
|}

明斯基 & 帕伯特的感知器展示了一个没有隐藏层的感知器在 XOR 上失效。20世纪70年代人工智能冬天 I |-| 1974 | | 伊辛磁场模型由 WA Little 提出用于认知 |-| 1980 | | 福岛引入了新认知器，它后来被称为卷积神经网络。它主要用于 SL 中，但值得在此提及。描述为 CAM 和分类器的 Ising 变体 Hopfield 网。谢灵顿和柯克帕特里克在1975年的工作之后，Hinton & Sejnowski 描述了具有概率神经元的伊辛变异波茨曼机。1986年 Paul Smolensky 发表了和谐理论，这是一个具有几乎相同的玻尔兹曼能量函数的 RBM。斯摩棱斯基没有给出一个切实可行的培训计划。Hinton 在2000年代中期 |-| 1995 | | Schmidthuber 介绍了用于语言的 LSTM 神经元。Dayan & Hinton 公司介绍了 Helmholtz 机器1995-2005人工智能冬季 II 2013-2013 Kingma，Rezende 等公司介绍了变分自动编码器作为贝叶斯图形概率网络，神经网络作为组件。|}

=== Specific Networks ===

=== Specific Networks ===

= = 特定网络 = =

Here, we highlight some characteristics of select networks. The details of each are given in the comparison table below.

Here, we highlight some characteristics of select networks. The details of each are given in the comparison table below.

在这里，我们强调选择网络的一些特征。每个项目的详细情况见下表。

{{glossary}}
{{term |1=[[Hopfield Network]]}}
{{defn |1=Ferromagnetism inspired Hopfield networks. A neuron correspond to an iron domain with binary magnetic moments Up and Down, and neural connections correspond to the domain's influence on each other. Symmetric connections enables a global energy formulation. During inference the network updates each state using the standard activation step function. Symmetric weights and the right energy functions guarantees convergence to a stable activation pattern. Asymmetric weights are difficult to analyze. Hopfield nets are used as Content Addressable Memories (CAM).}}

{{term |1=[[Boltzmann Machine]]}}
{{defn |1=These are stochastic Hopfield nets. Their state value is sampled from this pdf as follows: suppose a binary neuron fires with the Bernoulli probability p(1) = 1/3 and rests with p(0) = 2/3. One samples from it by taking a UNIFORMLY distributed random number y, and plugging it into the inverted cumulative distribution function, which in this case is the step function thresholded at 2/3. The inverse function = { 0 if x <= 2/3, 1 if x > 2/3 } }}

{{term |1=Sigmoid Belief Net}}
{{defn |1=Introduced by Radford Neal in 1992, this network applies ideas from probabilistic graphical models to neural networks. A key difference is that nodes in graphical models have pre-assigned meanings, whereas Belief Net neurons' features are determined after training. The network is a sparsely connected directed acyclic graph composed of binary stochastic neurons. The learning rule comes from Maximum Likelihood on p(X): Δwij <math>\propto</math> sj * (si - pi), where pi = 1 / ( 1 + eweighted inputs into neuron i ). sj's are activations from an unbiased sample of the posterior distribution and this is problematic due to the Explaining Away problem raised by Judea Perl. [[Variational Bayesian methods]] uses a surrogate posterior and blatantly disregard this complexity. }}

{{term |1= [[Deep Belief Network]] }}
{{defn |1=Introduced by Hinton, this network is a hybrid of RBM and Sigmoid Belief Network. The top 2 layers is an RBM and the second layer downwards form a sigmoid belief network. One trains it by the stacked RBM method and then throw away the recognition weights below the top RBM. As of 2009, 3-4 layers seems to be the optimal depth.<ref name=HintonMlss2009/> }}

{{term |1=[[Helmholtz machine]]}}
{{defn |1=These are early inspirations for the Variational Auto Encoders. It's 2 networks combined into one—forward weights operates recognition and backward weights implements imagination. It is perhaps the first network to do both. Helmholtz did not work in machine learning but he inspired the view of "statistical inference engine whose function is to infer probable causes of sensory input" (3). the stochastic binary neuron outputs a probability that its state is 0 or 1. The data input is normally not considered a layer, but in the Helmholtz machine generation mode, the data layer receives input from the middle layer has separate weights for this purpose, so it is considered a layer. Hence this network has 3 layers.}}

{{term |1=[[Variational autoencoder]]}}
{{defn |1=These are inspired by Helmholtz machines and combines probability network with neural networks. An Autoencoder is a 3-layer CAM network, where the middle layer is supposed to be some internal representation of input patterns. The encoder neural network is a probability distribution qφ(z given x) and the decoder network is pθ(x given z). The weights are named phi & theta rather than W and V as in Helmholtz—a cosmetic difference. These 2 networks here can be fully connected, or use another NN scheme.
}}
{{glossary end}}

=== Comparison of networks ===

=== Comparison of networks ===

= = 网络比较 = =

{| class="wikitable"
|-
! !! Hopfield !! Boltzmann !! RBM !! Stacked RBM || Helmholtz !! Autoencoder !! VAE
|-
| '''Usage & notables''' || CAM, traveling salesman problem || CAM. The freedom of connections makes this network difficult to analyze. || pattern recognition. used in MNIST digits and speech. || recognition & imagination. trained with unsupervised pre-training and/or supervised fine tuning. || imagination, mimicry ||  language: creative writing, translation. vision: enhancing blurry images || generate realistic data
|-
| '''Neuron''' || deterministic binary state. Activation = { 0 (or -1) if x is negative, 1 otherwise } || stochastic binary Hopfield neuron || ← same. (extended to real-valued in mid 2000s) || ← same || ← same ||  language: LSTM. vision: local receptive fields. usually real valued relu activation. || middle layer neurons encode means & variances for Gaussians. In run mode (inference), the output of the middle layer are sampled values from the Gaussians.
|-
| '''Connections''' || 1-layer with symmetric weights. No self-connections. || 2-layers. 1-hidden & 1-visible. symmetric weights. || ← same. no lateral connections within a layer. || top layer is undirected, symmetric. other layers are 2-way, asymmetric. || 3-layers: asymmetric weights. 2 networks combined into 1. ||  3-layers. The input is considered a layer even though it has no inbound weights. recurrent layers for NLP. feedforward convolutions for vision. input & output have the same neuron counts. || 3-layers: input, encoder, distribution sampler decoder. the sampler is not considered a layer (e)
|-
| '''Inference & energy''' || Energy is given by Gibbs probability measure :<math>E = -\frac12\sum_{i,j}{w_{ij}{s_i}{s_j}}+\sum_i{\theta_i}{s_i}</math> || ← same || ← same ||  || minimize KL divergence || inference is only feed-forward. previous UL networks ran forwards AND backwards || minimize error = reconstruction error - KLD
|-
| '''Training''' || Δwij = si*sj, for +1/-1 neuron || Δwij = e*(pij - p'ij). This is derived from minimizing KLD. e = learning rate, p' = predicted and p = actual distribution.
|| Δwij = e*( < vi hj >data - < vi hj >equilibrium ). This is a form of contrastive divergence w/ Gibbs Sampling. "<>" are expectations. || ← similar. train 1-layer at a time. approximate equilibrium state with a 3-segment pass. no back propagation. || wake-sleep 2 phase training ||  back propagate the reconstruction error || reparameterize hidden state for backprop
|-
| '''Strength''' || resembles physical systems so it inherits their equations || ← same. hidden neurons act as internal representatation of the external world || faster more practical training scheme than Boltzmann machines || trains quickly. gives hierarchical layer of features || mildly anatomical. analyzable w/ information theory & statistical mechanics ||  || 
|-
| '''Weakness''' ||  || hard to train due to lateral connections ||  equilibrium requires too many iterations || integer & real-valued neurons are more complicated. ||  ||  || 
|}

{| class="wikitable"
|-
! !! Hopfield !! Boltzmann !! RBM !! Stacked RBM || Helmholtz !! Autoencoder !! VAE
|-
| Usage & notables || CAM, traveling salesman problem || CAM. The freedom of connections makes this network difficult to analyze. || pattern recognition. used in MNIST digits and speech. || recognition & imagination. trained with unsupervised pre-training and/or supervised fine tuning. || imagination, mimicry || language: creative writing, translation. vision: enhancing blurry images || generate realistic data
|-
| Neuron || deterministic binary state. Activation = { 0 (or -1) if x is negative, 1 otherwise } || stochastic binary Hopfield neuron || ← same. (extended to real-valued in mid 2000s) || ← same || ← same || language: LSTM. vision: local receptive fields. usually real valued relu activation. || middle layer neurons encode means & variances for Gaussians. In run mode (inference), the output of the middle layer are sampled values from the Gaussians.
|-
| Connections || 1-layer with symmetric weights. No self-connections. || 2-layers. 1-hidden & 1-visible. symmetric weights. || ← same. no lateral connections within a layer. || top layer is undirected, symmetric. other layers are 2-way, asymmetric. || 3-layers: asymmetric weights. 2 networks combined into 1. || 3-layers. The input is considered a layer even though it has no inbound weights. recurrent layers for NLP. feedforward convolutions for vision. input & output have the same neuron counts. || 3-layers: input, encoder, distribution sampler decoder. the sampler is not considered a layer (e)
|-
| Inference & energy || Energy is given by Gibbs probability measure :E = -\frac12\sum_{i,j}{w_{ij}{s_i}{s_j}}+\sum_i{\theta_i}{s_i} || ← same || ← same || || minimize KL divergence || inference is only feed-forward. previous UL networks ran forwards AND backwards || minimize error = reconstruction error - KLD
|-
| Training || Δwij = si*sj, for +1/-1 neuron || Δwij = e*(pij - p'ij). This is derived from minimizing KLD. e = learning rate, p' = predicted and p = actual distribution.
|| Δwij = e*( < vi hj >data - < vi hj >equilibrium ). This is a form of contrastive divergence w/ Gibbs Sampling. "<>" are expectations. || ← similar. train 1-layer at a time. approximate equilibrium state with a 3-segment pass. no back propagation. || wake-sleep 2 phase training || back propagate the reconstruction error || reparameterize hidden state for backprop
|-
| Strength || resembles physical systems so it inherits their equations || ← same. hidden neurons act as internal representatation of the external world || faster more practical training scheme than Boltzmann machines || trains quickly. gives hierarchical layer of features || mildly anatomical. analyzable w/ information theory & statistical mechanics || ||
|-
| Weakness || || hard to train due to lateral connections || equilibrium requires too many iterations || integer & real-valued neurons are more complicated. || || ||
|}

{| class="wikitable"
|-
!!!霍普菲尔德！玻尔兹曼！RBM! ！堆叠的 RBM | | Helmholtz! ！自动编码器！旅行推销员问题。连接的自由使得这个网络难以分析。模式识别。用于 MNIST 的数字和语音。识别和想象。接受无监督的预先培训和/或监督的微调培训。想象，模仿，语言: 创造性写作，翻译。视觉: 增强模糊图像 | | 生成真实的数据 |-| 神经元 | | 确定性二进制状态。如果 x 为负，激活 = {0(或 -1) ，否则} | | 随机二进制 Hopfield 神经元 | | ∞相同。(扩展到2000年代中期的实际价值) | | ——相同 | | ——相同 | | 语言: LSTM。视野: 本地接收场。通常是真正有价值的 Relu 激活。中层神经元对高斯函数的均值和方差编码。在运行模式(推理) ，输出的中间层是从高斯采样值。具有对称重量的1层连接。没有自我联系。两层。1-隐藏和1-可见。对称重量。|| ← same.层内无侧向连接。表层是无向的，对称的。其他层是双向的，不对称的。三层: 不对称重量。2个网络合并成1个。三层。输入被认为是一个层，即使它没有入站权重。自然语言处理的循环层。视觉前馈卷积。输入和输出的神经元计数相同。三层: 输入，编码器，分布式采样解码器。采样器不被认为是一个层(e) |-| 推断和能量 | | 能量由吉布斯机率量测给出: E =-fra 12 sum _ { i，j }{ w _ { ij }{ s _ j }}} + sum _ i { theta _ i }{ s _ i } | |  | 2 | | 2 | | 2 | | 3 | | 最小化 KL 散度 | | 推断只是前馈。以前的 UL 网络向前和向后运行 | | 最小化错误 = 重建错误-KLD |-| 训练 | | Δwij = si * sj，对于 + 1/-1神经元 | | Δwij = e * (pij-p‘ ij)。这是从最小化 KLD 得出的。E = 学习率，p’= 预测值，p = 实际分布。|| Δwij = e*( < vi hj >data - < vi hj >equilibrium ).这是 W/Gibbs 抽样的一种对比发散形式。“ < >”是期望。|| ← similar.一次训练一层。三段式近似平衡态。没有反向传播。唤醒-睡眠两阶段训练背向传播重构误差重新参数化背撑的隐藏状态力量类似于物理系统所以它继承了它们的方程式。隐藏的神经元作为外部世界的内部表征，比玻尔兹曼机器更快更实用的训练方案。具有层次分明的特征。可分析的 w/信息理论和统计力学由于横向联系很难训练平衡需要太多的迭代整数和实值神经元更加复杂。|| || ||
|}

'''Hebbian Learning, ART, SOM''' 
The classical example of unsupervised learning in the study of neural networks is [[Donald Hebb]]'s principle, that is, neurons that fire together wire together.<ref name="Buhmann" /> In [[Hebbian learning]], the connection is reinforced irrespective of an error, but is exclusively a function of the coincidence between action potentials between the two neurons.<ref name="Comesana" /> A similar version that modifies synaptic weights takes into account the time between the action potentials ([[spike-timing-dependent plasticity]] or STDP). Hebbian Learning has been hypothesized to underlie a range of cognitive functions, such as [[pattern recognition]] and experiential learning.

Hebbian Learning, ART, SOM
The classical example of unsupervised learning in the study of neural networks is Donald Hebb's principle, that is, neurons that fire together wire together. In Hebbian learning, the connection is reinforced irrespective of an error, but is exclusively a function of the coincidence between action potentials between the two neurons. A similar version that modifies synaptic weights takes into account the time between the action potentials (spike-timing-dependent plasticity or STDP). Hebbian Learning has been hypothesized to underlie a range of cognitive functions, such as pattern recognition and experiential learning.

Hebbian Learning，ART，SOM 神经网络研究中非监督式学习的典型例子就是 Donald Hebb 的原理，也就是说，神经元一起发射，连接在一起。在 Hebbian 学习中，这种联系被加强了，不管是否有错误，但这完全是两个神经元之间动作电位的一致性的函数。一个类似的修改突触重量的版本考虑到了动作电位(电峰时间相关突触可塑性或 STDP)之间的时间间隔。赫布学习被假设为一系列认知功能的基础，如模式识别和经验学习。

Among [[Artificial neural network|neural network]] models, the [[self-organizing map]] (SOM) and [[adaptive resonance theory]] (ART) are commonly used in unsupervised learning algorithms. The SOM is a topographic organization in which nearby locations in the map represent inputs with similar properties. The ART model allows the number of clusters to vary with problem size and lets the user control the degree of similarity between members of the same clusters by means of a user-defined constant called the vigilance parameter. ART networks are used for many pattern recognition tasks, such as [[automatic target recognition]] and seismic signal processing.<ref name="Carpenter" />

Among neural network models, the self-organizing map (SOM) and adaptive resonance theory (ART) are commonly used in unsupervised learning algorithms. The SOM is a topographic organization in which nearby locations in the map represent inputs with similar properties. The ART model allows the number of clusters to vary with problem size and lets the user control the degree of similarity between members of the same clusters by means of a user-defined constant called the vigilance parameter. ART networks are used for many pattern recognition tasks, such as automatic target recognition and seismic signal processing.

在神经网络模型中，自组织映射(SOM)和自适应共振理论(ART)常用于非监督式学习算法。SOM 是一种地形组织，其中地图中附近的位置表示具有相似属性的输入。ART 模型允许聚类的数量随问题的大小而变化，并允许用户通过自定义的警戒参数来控制同一聚类成员之间的相似程度。ART 网络用于自动目标识别、地震信号处理等多种模式识别任务。

== Probabilistic methods ==
Two of the main methods used in unsupervised learning are [[Principal component analysis|principal component]] and [[cluster analysis]]. [[Cluster analysis]] is used in unsupervised learning to group, or segment, datasets with shared attributes in order to extrapolate algorithmic relationships.<ref name="tds-ul" /> Cluster analysis is a branch of [[machine learning]] that groups the data that has not been [[Labeled data|labelled]], classified or categorized. Instead of responding to feedback, cluster analysis identifies commonalities in the data and reacts based on the presence or absence of such commonalities in each new piece of data. This approach helps detect anomalous data points that do not fit into either group.

Two of the main methods used in unsupervised learning are principal component and cluster analysis. Cluster analysis is used in unsupervised learning to group, or segment, datasets with shared attributes in order to extrapolate algorithmic relationships. Cluster analysis is a branch of machine learning that groups the data that has not been labelled, classified or categorized. Instead of responding to feedback, cluster analysis identifies commonalities in the data and reacts based on the presence or absence of such commonalities in each new piece of data. This approach helps detect anomalous data points that do not fit into either group.

= = 概率方法 = = 在非监督式学习中使用的两种主要方法是主成分法和数据聚类法。数据聚类非监督式学习用于对具有共享属性的数据集进行分组或分段，以便推断算法关系。数据聚类是机器学习的一个分支，它将没有标记、分类或分类的数据进行分类。数据聚类不会对反馈作出反应，而是识别数据中的共性，并根据每个新数据中是否存在这种共性作出反应。这种方法有助于检测不属于任何一组的异常数据点。

A central application of unsupervised learning is in the field of [[density estimation]] in [[statistics]],<ref name="JordanBishop2004" /> though unsupervised learning encompasses many other domains involving summarizing and explaining data features. It can be contrasted with supervised learning by saying that whereas supervised learning intends to infer a [[conditional probability distribution]] conditioned on the label of input data; unsupervised learning intends to infer an [[a priori probability]] distribution .

A central application of unsupervised learning is in the field of density estimation in statistics, though unsupervised learning encompasses many other domains involving summarizing and explaining data features. It can be contrasted with supervised learning by saying that whereas supervised learning intends to infer a conditional probability distribution conditioned on the label of input data; unsupervised learning intends to infer an a priori probability distribution .

非监督式学习的一个核心应用是统计学中的密度估计领域，尽管非监督式学习包括许多其他涉及总结和解释数据特征的领域。它可以与监督式学习进行对比，比如说，监督式学习试图以输入数据的标签为条件推断出一个条件概率分布，而非监督式学习试图推断出一个先验概率分布。

=== Approaches ===
Some of the most common algorithms used in unsupervised learning include: (1) Clustering, (2) Anomaly detection, (3) Approaches for learning latent variable models. Each approach uses several methods as follows:

Some of the most common algorithms used in unsupervised learning include: (1) Clustering, (2) Anomaly detection, (3) Approaches for learning latent variable models. Each approach uses several methods as follows:

= = = 方法 = = = 在非监督式学习中使用的一些最常见的算法包括: (1)聚类，(2)异常检测，(3)学习潜变量模型的方法。每种方法使用以下几种方法:

* [[Data clustering|Clustering]] methods include: [[hierarchical clustering]],<ref name="Hastie" /> [[k-means]],<ref name="tds-kmeans" /> [[mixture models]], [[DBSCAN]], and [[OPTICS algorithm]]
* [[Anomaly detection]] methods include: [[Local Outlier Factor]], and [[Isolation Forest]]
* Approaches for learning [[latent variable model]]s such as [[Expectation–maximization algorithm]] (EM), [[Method of moments (statistics)|Method of moments]], and [[Blind signal separation]] techniques ([[Principal component analysis]], [[Independent component analysis]], [[Non-negative matrix factorization]], [[Singular value decomposition]])

* Clustering methods include: hierarchical clustering, k-means, mixture models, DBSCAN, and OPTICS algorithm
* Anomaly detection methods include: Local Outlier Factor, and Isolation Forest
* Approaches for learning latent variable models such as Expectation–maximization algorithm (EM), Method of moments, and Blind signal separation techniques (Principal component analysis, Independent component analysis, Non-negative matrix factorization, Singular value decomposition)

* 聚类方法包括: 分层聚类、 k-means、混合模型、 DBSCAN 和 OPTICS 算法
* 异常检测方法包括: 局部异常因子和隔离森林
* 学习潜在变量模型的方法，如期望最大化算法(EM)、矩方法和盲信号分离技术(主成分分析、独立元素分析、非负矩阵分解、奇异值分解)

=== Method of moments ===
One of the statistical approaches for unsupervised learning is the [[Method of moments (statistics)|method of moments]]. In the method of moments, the unknown parameters (of interest) in the model are related to the moments of one or more random variables, and thus, these unknown parameters can be estimated given the moments. The moments are usually estimated from samples empirically. The basic moments are first and second order moments. For a random vector, the first order moment is the [[mean]] vector, and the second order moment is the [[covariance matrix]] (when the mean is zero). Higher order moments are usually represented using [[tensors]] which are the generalization of matrices to higher orders as multi-dimensional arrays.

One of the statistical approaches for unsupervised learning is the method of moments. In the method of moments, the unknown parameters (of interest) in the model are related to the moments of one or more random variables, and thus, these unknown parameters can be estimated given the moments. The moments are usually estimated from samples empirically. The basic moments are first and second order moments. For a random vector, the first order moment is the mean vector, and the second order moment is the covariance matrix (when the mean is zero). Higher order moments are usually represented using tensors which are the generalization of matrices to higher orders as multi-dimensional arrays.

矩量法非监督式学习的统计方法之一就是矩量法。在矩方法中，模型中的未知参数(感兴趣的)与一个或多个随机变量的矩有关，因此，这些未知参数可以根据矩估计出来。这些矩通常是根据样本经验估计出来的。基本矩是一阶和二阶矩。对于随机向量，一阶矩是平均向量，二阶矩是协方差矩阵(当平均值为零时)。高阶矩阵通常用张量来表示，张量是矩阵作为多维数组的高阶推广。

In particular, the method of moments is shown to be effective in learning the parameters of [[latent variable model]]s. Latent variable models are statistical models where in addition to the observed variables, a set of latent variables also exists which is not observed. A highly practical example of latent variable models in machine learning is the [[topic modeling]] which is a statistical model for generating the words (observed variables) in the document based on the topic (latent variable) of the document. In the topic modeling, the words in the document are generated according to different statistical parameters when the topic of the document is changed. It is shown that method of moments (tensor decomposition techniques) consistently recover the parameters of a large class of latent variable models under some assumptions.<ref name="TensorLVMs" />

In particular, the method of moments is shown to be effective in learning the parameters of latent variable models. Latent variable models are statistical models where in addition to the observed variables, a set of latent variables also exists which is not observed. A highly practical example of latent variable models in machine learning is the topic modeling which is a statistical model for generating the words (observed variables) in the document based on the topic (latent variable) of the document. In the topic modeling, the words in the document are generated according to different statistical parameters when the topic of the document is changed. It is shown that method of moments (tensor decomposition techniques) consistently recover the parameters of a large class of latent variable models under some assumptions.

特别是，矩方法被证明是有效的学习参数的潜变量模型。潜变量模型是一种统计模型，除了观察到的变量之外，还存在一组潜变量，但是没有被观察到。机器学习中潜变量模型的一个非常实用的例子是主题建模，它是一个统计模型，用于根据文档的主题(潜变量)生成文档中的单词(观察变量)。在主题建模中，当文档的主题发生变化时，根据不同的统计参数生成文档中的单词。结果表明，矩方法(张量分解技术)可以在一定的假设条件下一致地恢复一大类潜变量模型的参数。

The [[Expectation–maximization algorithm]] (EM) is also one of the most practical methods for learning latent variable models. However, it can get stuck in local optima, and it is not guaranteed that the algorithm will converge to the true unknown parameters of the model. In contrast, for the method of moments, the global convergence is guaranteed under some conditions.{{Machine learning|Problems}}

The Expectation–maximization algorithm (EM) is also one of the most practical methods for learning latent variable models. However, it can get stuck in local optima, and it is not guaranteed that the algorithm will converge to the true unknown parameters of the model. In contrast, for the method of moments, the global convergence is guaranteed under some conditions.

期望最大化算法(EM)也是学习潜变量模型最实用的方法之一。但是，这种算法会陷入局部最优，而且不能保证算法收敛到模型的真实未知参数。相比之下，对于矩量法，在一定条件下保证了全局收敛性。

== See also ==
* [[Automated machine learning]]
* [[Cluster analysis]]
* [[Anomaly detection]]
* [[Expectation–maximization algorithm]]
* [[Generative topographic map]]
* [[Meta-learning (computer science)]]
* [[Multivariate analysis]]
* [[Radial basis function network]]
*[[Weak supervision]]

* Automated machine learning
* Cluster analysis
* Anomaly detection
* Expectation–maximization algorithm
* Generative topographic map
* Meta-learning (computer science)
* Multivariate analysis
* Radial basis function network
*Weak supervision

= = 参见同样 = =
* 自动机器学习
* 数据聚类
* 异常检测
* 期望最大化算法
* 生成地形图
* 元学习(计算机科学)
* 多变量分析
* 径向基函数网络
* 监督不力

== References ==
{{Reflist|
refs=
<ref name="Hinton99a" >{{cite book |first1=Geoffrey |last1=Hinton |first2=Terrence |last2=Sejnowski |title=Unsupervised Learning: Foundations of Neural Computation |publisher= MIT Press |year=1999 |isbn=978-0262581684 }}</ref>
<ref name="tds-ul" >{{Cite web|url=https://towardsdatascience.com/unsupervised-machine-learning-clustering-analysis-d40f2b34ae7e|title=Unsupervised Machine Learning: Clustering Analysis|last=Roman|first=Victor|date=2019-04-21|website=Medium|access-date=2019-10-01}}</ref>
<ref name="JordanBishop2004">{{cite book |first1=Michael I. |last1=Jordan |first2=Christopher M. |last2=Bishop |chapter=Neural Networks |editor=Allen B. Tucker |title=Computer Science Handbook, Second Edition (Section VII: Intelligent Systems) |location=Boca Raton, Florida |publisher=Chapman & Hall/CRC Press LLC |year=2004 |isbn=1-58488-360-X }}</ref>
<ref name="Hastie" >{{cite book|last=Hastie, Trevor, Robert Tibshirani|first=Friedman, Jerome|title=The Elements of Statistical Learning: Data mining, Inference, and Prediction|date=2009|publisher=Springer|location=New York|isbn=978-0-387-84857-0|pages=485–586}}</ref>
<ref name="tds-kmeans" >{{Cite web|url=https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1|title=Understanding K-means Clustering in Machine Learning|last=Garbade|first=Dr Michael J.|date=2018-09-12|website=Medium|language=en|access-date=2019-10-31}}</ref>
<ref name="TensorLVMs" >{{cite journal |last1=Anandkumar |first1=Animashree |last2=Ge |first2=Rong |last3=Hsu |first3=Daniel |last4=Kakade |first4=Sham |first5= Matus |last5=Telgarsky |date=2014 |title=Tensor Decompositions for Learning Latent Variable Models |url=http://www.jmlr.org/papers/volume15/anandkumar14b/anandkumar14b.pdf |journal=Journal of Machine Learning Research |volume=15 |pages=2773–2832|bibcode=2012arXiv1210.7559A |arxiv=1210.7559 }}</ref>
<ref name="Buhmann" >{{Cite book|last1=Buhmann|first1=J.|last2=Kuhnel|first2=H.|title= [Proceedings 1992] IJCNN International Joint Conference on Neural Networks|volume=4|pages=796–801|publisher=IEEE|doi=10.1109/ijcnn.1992.227220|isbn=0780305590|chapter=Unsupervised and supervised data clustering with competitive neural networks|year=1992|s2cid=62651220}}</ref>
<ref name="Comesana" >{{Cite journal|last1=Comesaña-Campos|first1=Alberto|last2=Bouza-Rodríguez|first2=José Benito|date=June 2016|title=An application of Hebbian learning in the design process decision-making|journal=Journal of Intelligent Manufacturing|volume=27|issue=3|pages=487–506|doi=10.1007/s10845-014-0881-z|s2cid=207171436|issn=0956-5515|url=https://www.semanticscholar.org/paper/4059b77be03fea077350c106e6e9aa9fce23e8c7}}</ref>
<ref name="Carpenter" >{{cite journal|author1=Carpenter, G.A. |author2=Grossberg, S. |name-list-style=amp |year=1988|title=The ART of adaptive pattern recognition by a self-organizing neural network|journal= Computer|volume=21|issue=3 |pages=77–88|url=http://www.cns.bu.edu/Profiles/Grossberg/CarGro1988Computer.pdf|doi=10.1109/2.33|s2cid=14625094 }}</ref>
<ref name="Hinton2010" >{{cite news | author = Hinton, G | date=2010-08-02 | title = A Practical Guide to Training Restricted Boltzmann Machines }}</ref>
<ref name="HintonMlss2009" >{{cite web |people=Hinton, Geoffrey |date=September 2009 |title=Deep Belief Nets |type=video |url=https://videolectures.net/mlss09uk_hinton_dbn }}</ref>
}}

== Further reading ==
* {{cite book |editor1=Bousquet, O. |editor3=Raetsch, G. |editor2=von Luxburg, U. |title=Advanced Lectures on Machine Learning |url=https://archive.org/details/springer_10.1007-b100712 |publisher=Springer-Verlag |year=2004 |isbn=978-3540231226}}
* {{cite book |author1=Duda, Richard O. |author2-link=Peter E. Hart |author2=Hart, Peter E. |author3=Stork, David G. |year=2001 |chapter=Unsupervised Learning and Clustering |title=Pattern classification |edition=2nd |publisher=Wiley |isbn=0-471-05669-3|author1-link=Richard O. Duda |title-link=Pattern classification }}
*{{cite book |first1=Trevor |last1=Hastie |first2=Robert |last2=Tibshirani |title=The Elements of Statistical Learning: Data mining, Inference, and Prediction |year=2009 |publisher=Springer| location=New York |isbn=978-0-387-84857-0 |pages=485–586 |doi=10.1007/978-0-387-84858-7_14}}
* {{cite book |editor1=Hinton, Geoffrey |editor-link=Geoffrey Hinton |editor2=Sejnowski, Terrence J. |editor2-link=Terrence J. Sejnowski |year=1999 |title=Unsupervised Learning: Foundations of Neural Computation |publisher=[[MIT Press]] |isbn=0-262-58168-X}}

*
*
*
*

= = 延伸阅读 = =
*
*
*
*

{{Differentiable computing}}
{{Authority control}}

{{DEFAULTSORT:Unsupervised Learning}}
[[Category:Unsupervised learning| ]]
[[Category:Machine learning]]

Category:Machine learning

分类: 机器学习

<noinclude>

This page was moved from [[wikipedia:en:Unsupervised learning]]. Its edit history can be viewed at [[无监督学习/edithistory]]</noinclude>

[[Category:待整理页面]]

Moonscar

1,564

个编辑

更改

无监督学习 (查看源代码)

2022年6月29日 (三) 14:32的版本

导航菜单

搜索