卷积神经网络

来自集智百科 - 复杂系统|人工智能|复杂科学|复杂网络|自组织
跳到导航 跳到搜索

此词条暂由彩云小译翻译,翻译字数共9293,未经人工整理和审校,带来阅读不便,请见谅。

模板:Other uses 模板:More citations needed 模板:Machine learning In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery.[1] CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Networks (SIANN), based on the shared-weight architecture of the convolution kernels or filters that slide along input features and provide translation-equivariant responses known as feature maps.[2][3] Counter-intuitively, most convolutional neural networks are not invariant to translation, due to the downsampling operation they apply to the input.[4] They have applications in image and video recognition, recommender systems,[5] image classification, image segmentation, medical image analysis, natural language processing,[6] brain–computer interfaces,[7] and financial time series.[8]



In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Networks (SIANN), based on the shared-weight architecture of the convolution kernels or filters that slide along input features and provide translation-equivariant responses known as feature maps. Counter-intuitively, most convolutional neural networks are not invariant to translation, due to the downsampling operation they apply to the input. They have applications in image and video recognition, recommender systems, image classification, image segmentation, medical image analysis, natural language processing, brain–computer interfaces, and financial time series.

在深度学习中,卷积神经网络是一类人工神经网络(ANN) ,最常用于分析视觉图像。神经网络也被称为移位不变或空间不变人工神经网络(SIANN) ,基于卷积核或过滤器的共享权重结构,它们沿着输入特征滑动,并提供被称为特征映射的平移等变响应。与直觉相反,大多数卷积神经网络并非不变的平移,由于他们应用的下采样操作的输入。它们在图像和视频识别、推荐系统、图像分类、图像分割、医学图像分析、自然语言处理、大脑-计算机接口和金融时间序列等领域都有应用。

CNNs are regularized versions of multilayer perceptrons. Multilayer perceptrons usually mean fully connected networks, that is, each neuron in one layer is connected to all neurons in the next layer. The "full connectivity" of these networks make them prone to overfitting data. Typical ways of regularization, or preventing overfitting, include: penalizing parameters during training (such as weight decay) or trimming connectivity (skipped connections, dropout, etc.) CNNs take a different approach towards regularization: they take advantage of the hierarchical pattern in data and assemble patterns of increasing complexity using smaller and simpler patterns embossed in their filters. Therefore, on a scale of connectivity and complexity, CNNs are on the lower extreme.

CNNs are regularized versions of multilayer perceptrons. Multilayer perceptrons usually mean fully connected networks, that is, each neuron in one layer is connected to all neurons in the next layer. The "full connectivity" of these networks make them prone to overfitting data. Typical ways of regularization, or preventing overfitting, include: penalizing parameters during training (such as weight decay) or trimming connectivity (skipped connections, dropout, etc.) CNNs take a different approach towards regularization: they take advantage of the hierarchical pattern in data and assemble patterns of increasing complexity using smaller and simpler patterns embossed in their filters. Therefore, on a scale of connectivity and complexity, CNNs are on the lower extreme.

CNN 是多层感知器的正则化版本。多层感知器通常意味着完全连通的网络,也就是说,一层中的每个神经元都连接到下一层中的所有神经元。这些网络的“完全连通性”使它们容易过度拟合数据。典型的正规化或防止过度装配的方法包括: 惩罚训练期间的参数(如体重衰减)或调整连接(跳过连接,辍学等)CNN 采取了一种不同的正则化方法: 它们利用数据中的分层模式,并使用压印在过滤器中的更小、更简单的模式来组合日益复杂的模式。因此,在连通性和复杂性的尺度上,CNN 处于较低的极端。

Convolutional networks were inspired by biological processes[9][10][11][12] in that the connectivity pattern between neurons resembles the organization of the animal visual cortex. Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field.

Convolutional networks were inspired by biological processes in that the connectivity pattern between neurons resembles the organization of the animal visual cortex. Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field.

卷积网络的灵感来自于生物学过程,因为神经元之间的连接模式类似于动物视觉皮层的组织。单个的皮层神经元对刺激的反应仅限于视野中一个被称为接受区的受限区域。不同神经元的感受区部分重叠,覆盖了整个视野。

CNNs use relatively little pre-processing compared to other image classification algorithms. This means that the network learns to optimize the filters (or kernels) through automated learning, whereas in traditional algorithms these filters are hand-engineered. This independence from prior knowledge and human intervention in feature extraction is a major advantage.

CNNs use relatively little pre-processing compared to other image classification algorithms. This means that the network learns to optimize the filters (or kernels) through automated learning, whereas in traditional algorithms these filters are hand-engineered. This independence from prior knowledge and human intervention in feature extraction is a major advantage.

与其他图像分类算法相比,CNN 使用相对较少的预处理。这意味着网络通过自动学习学习优化过滤器(或内核) ,而在传统算法中,这些过滤器是手工设计的。这种独立于先验知识和人工干预的特征提取是一个主要优势。

模板:TOC limit

Definition

Convolutional neural networks are a specialized type of artificial neural networks that use a mathematical operation called convolution in place of general matrix multiplication in at least one of their layers.[13] They are specifically designed to process pixel data and are used in image recognition and processing.

Convolutional neural networks are a specialized type of artificial neural networks that use a mathematical operation called convolution in place of general matrix multiplication in at least one of their layers. They are specifically designed to process pixel data and are used in image recognition and processing.

卷积神经网络是一种特殊类型的人工神经网络,它使用一种称为卷积的数学运算来代替至少一层中的一般矩阵乘法。它们是专门设计用于处理像素数据,并用于图像识别和处理。

Architecture

文件:Comparison image neural networks.svg
Comparison of the LeNet and AlexNet convolution, pooling and dense layers
(AlexNet image size should be 227x227x3, instead of 224x224x3, so the math will come out right. The original paper said different numbers, but Andrej Karpathy, the head of computer vision at Tesla, said it should be 227x227x3 (he said Alex didn't describe why he put 224x224x3). The next convolution should be 11x11 with stride 4: 55x55x96 (instead of 54x54x96). It would be calculated, for example, as: [(input width 227 - kernel width 11) / stride 4] + 1 = [(227 - 11) / 4] + 1 = 55. Since the kernel output is the same length as width, its area is 55x55.)

A convolutional neural network consists of an input layer, hidden layers and an output layer. In any feed-forward neural network, any middle layers are called hidden because their inputs and outputs are masked by the activation function and final convolution. In a convolutional neural network, the hidden layers include layers that perform convolutions. Typically this includes a layer that performs a dot product of the convolution kernel with the layer's input matrix. This product is usually the Frobenius inner product, and its activation function is commonly ReLU. As the convolution kernel slides along the input matrix for the layer, the convolution operation generates a feature map, which in turn contributes to the input of the next layer. This is followed by other layers such as pooling layers, fully connected layers, and normalization layers.

文件:Comparison image neural networks.svg
Comparison of the LeNet and AlexNet convolution, pooling and dense layers(AlexNet image size should be 227x227x3, instead of 224x224x3, so the math will come out right. The original paper said different numbers, but Andrej Karpathy, the head of computer vision at Tesla, said it should be 227x227x3 (he said Alex didn't describe why he put 224x224x3). The next convolution should be 11x11 with stride 4: 55x55x96 (instead of 54x54x96). It would be calculated, for example, as: [(input width 227 - kernel width 11) / stride 4] + 1 = [(227 - 11) / 4] + 1 = 55. Since the kernel output is the same length as width, its area is 55x55.)

A convolutional neural network consists of an input layer, hidden layers and an output layer. In any feed-forward neural network, any middle layers are called hidden because their inputs and outputs are masked by the activation function and final convolution. In a convolutional neural network, the hidden layers include layers that perform convolutions. Typically this includes a layer that performs a dot product of the convolution kernel with the layer's input matrix. This product is usually the Frobenius inner product, and its activation function is commonly ReLU. As the convolution kernel slides along the input matrix for the layer, the convolution operation generates a feature map, which in turn contributes to the input of the next layer. This is followed by other layers such as pooling layers, fully connected layers, and normalization layers.

= = Architecture = = [文件: 比较图像神经网络.svg | 拇指 | 480px | 比较 LeNet 和 AlexNet 的卷积,池和密集层(AlexNet 图像大小应该是227x227x3,而不是224x224x3,所以数学将出来正确。最初的论文说了不同的数字,但是特斯拉计算机视觉部门的负责人安德烈 · 卡帕西说应该是227x227x3(他说亚历克斯没有描述他为什么写224x224x3)。下一个卷积应该是11x11,步幅4:55x55x96(而不是54x54x96)。例如,它的计算结果是: [(输入宽度227-内核宽度11)/stride 4] + 1 = [(227-11)/4] + 1 = 55。因为内核输出的长度和宽度相同,所以它的面积是55x55。)]]卷积神经网络由输入层、隐藏层和输出层组成。在任何前馈神经网络中,任何中间层都被称为隐藏层,因为它们的输入和输出被激活函数和最终卷积所掩盖。在卷积神经网络中,隐藏层包括执行卷积的层。通常,这包括一个层,该层使用该层的输入矩阵执行卷积内核的点积。这个产品通常是 Frobenius 内部产品,它的激活函数通常是 ReLU。当卷积核沿着该层的输入矩阵滑动时,卷积操作生成一个特征映射,这个特征映射反过来又有助于下一层的输入。接下来是其他层,例如池层、完全连接层和规范化层。

Convolutional layers

In a CNN, the input is a tensor with a shape: (number of inputs) x (input height) x (input width) x (input channels). After passing through a convolutional layer, the image becomes abstracted to a feature map, also called an activation map, with shape: (number of inputs) x (feature map height) x (feature map width) x (feature map channels).

In a CNN, the input is a tensor with a shape: (number of inputs) x (input height) x (input width) x (input channels). After passing through a convolutional layer, the image becomes abstracted to a feature map, also called an activation map, with shape: (number of inputs) x (feature map height) x (feature map width) x (feature map channels).

在 CNN 中,输入是一个形状为: (输入数量) x (输入高度) x (输入宽度) x (输入通道)的张量。经过卷积层后,图像被抽象为一个特征映射,也称为激活映射,其形状为: (输入数量) x (特征映射高度) x (特征映射宽度) x (特征映射通道)。

Convolutional layers convolve the input and pass its result to the next layer. This is similar to the response of a neuron in the visual cortex to a specific stimulus.[14] Each convolutional neuron processes data only for its receptive field. Although fully connected feedforward neural networks can be used to learn features and classify data, this architecture is generally impractical for larger inputs such as high-resolution images. It would require a very high number of neurons, even in a shallow architecture, due to the large input size of images, where each pixel is a relevant input feature. For instance, a fully connected layer for a (small) image of size 100 x 100 has 10,000 weights for each neuron in the second layer. Instead, convolution reduces the number of free parameters, allowing the network to be deeper.[15] For example, regardless of image size, using a 5 x 5 tiling region, each with the same shared weights, requires only 25 learnable parameters. Using regularized weights over fewer parameters avoids the vanishing gradients and exploding gradients problems seen during backpropagation in traditional neural networks.[16][17] Furthermore, convolutional neural networks are ideal for data with a grid-like topology (such as images) as spatial relations between separate features are taken into account during convolution and/or pooling.

Convolutional layers convolve the input and pass its result to the next layer. This is similar to the response of a neuron in the visual cortex to a specific stimulus. Each convolutional neuron processes data only for its receptive field. Although fully connected feedforward neural networks can be used to learn features and classify data, this architecture is generally impractical for larger inputs such as high-resolution images. It would require a very high number of neurons, even in a shallow architecture, due to the large input size of images, where each pixel is a relevant input feature. For instance, a fully connected layer for a (small) image of size 100 x 100 has 10,000 weights for each neuron in the second layer. Instead, convolution reduces the number of free parameters, allowing the network to be deeper. For example, regardless of image size, using a 5 x 5 tiling region, each with the same shared weights, requires only 25 learnable parameters. Using regularized weights over fewer parameters avoids the vanishing gradients and exploding gradients problems seen during backpropagation in traditional neural networks. Furthermore, convolutional neural networks are ideal for data with a grid-like topology (such as images) as spatial relations between separate features are taken into account during convolution and/or pooling.

卷积层卷积输入并将其结果传递给下一层。这类似于视觉皮层的神经元对特定刺激的反应。每个卷积神经元只为其感受野处理数据。尽管完全连通的前馈神经网络可以用来学习特征和分类数据,但这种结构对于高分辨率图像等较大的输入通常是不切实际的。由于图像的输入大小很大,即使在浅层结构中,每个像素都是相关的输入特征,因此也需要大量的神经元。例如,一个100 x 100大小的(小)图像的完全连接层对于第二层中的每个神经元有10,000个权重。相反,卷积减少了自由参数的数量,允许网络更深入。例如,不管图像大小如何,使用一个5 x 5的平铺区域(每个区域具有相同的共享权重)只需要25个可学习的参数。在较少的参数上使用正则化权值,避免了传统神经网络在反向传播过程中出现的消失梯度和爆炸梯度问题。此外,卷积神经网络对于具有网格拓扑结构(如图像)的数据是理想的,因为在卷积和/或合并过程中要考虑单独特征之间的空间关系。

Pooling layers

Convolutional networks may include local and/or global pooling layers along with traditional convolutional layers. Pooling layers reduce the dimensions of data by combining the outputs of neuron clusters at one layer into a single neuron in the next layer. Local pooling combines small clusters, tiling sizes such as 2 x 2 are commonly used. Global pooling acts on all the neurons of the feature map.[18][19] There are two common types of pooling in popular use: max and average. Max pooling uses the maximum value of each local cluster of neurons in the feature map,[20][21] while average pooling takes the average value.

Convolutional networks may include local and/or global pooling layers along with traditional convolutional layers. Pooling layers reduce the dimensions of data by combining the outputs of neuron clusters at one layer into a single neuron in the next layer. Local pooling combines small clusters, tiling sizes such as 2 x 2 are commonly used. Global pooling acts on all the neurons of the feature map. There are two common types of pooling in popular use: max and average. Max pooling uses the maximum value of each local cluster of neurons in the feature map, while average pooling takes the average value.

卷积网络可能包括局部和/或全局的和传统的卷积层一起的共享层。池层通过将一层的神经元簇的输出结合到下一层的单个神经元中来减少数据的维度。本地池结合了小型集群,通常使用2x2这样的平铺大小。全局池作用于特征映射的所有神经元。在流行的使用中有两种常见的池类型: max 和 average。最大池使用特征映射中每个局部神经元簇的最大值,而平均池使用平均值。

Fully connected layers

Fully connected layers

完全连接的层

Fully connected layers connect every neuron in one layer to every neuron in another layer. It is the same as a traditional multilayer perceptron neural network (MLP). The flattened matrix goes through a fully connected layer to classify the images.

Fully connected layers connect every neuron in one layer to every neuron in another layer. It is the same as a traditional multilayer perceptron neural network (MLP). The flattened matrix goes through a fully connected layer to classify the images.

完全连接的层将一层的每个神经元连接到另一层的每个神经元。它与传统的多层感知机神经网络(MLP)相同。平面矩阵通过一个完全连通的层来对图像进行分类。

Receptive field

In neural networks, each neuron receives input from some number of locations in the previous layer. In a convolutional layer, each neuron receives input from only a restricted area of the previous layer called the neuron's receptive field. Typically the area is a square (e.g. 5 by 5 neurons). Whereas, in a fully connected layer, the receptive field is the entire previous layer. Thus, in each convolutional layer, each neuron takes input from a larger area in the input than previous layers. This is due to applying the convolution over and over, which takes into account the value of a pixel, as well as its surrounding pixels. When using dilated layers, the number of pixels in the receptive field remains constant, but the field is more sparsely populated as its dimensions grow when combining the effect of several layers.

In neural networks, each neuron receives input from some number of locations in the previous layer. In a convolutional layer, each neuron receives input from only a restricted area of the previous layer called the neuron's receptive field. Typically the area is a square (e.g. 5 by 5 neurons). Whereas, in a fully connected layer, the receptive field is the entire previous layer. Thus, in each convolutional layer, each neuron takes input from a larger area in the input than previous layers. This is due to applying the convolution over and over, which takes into account the value of a pixel, as well as its surrounding pixels. When using dilated layers, the number of pixels in the receptive field remains constant, but the field is more sparsely populated as its dimensions grow when combining the effect of several layers.

在神经网络中,每个神经元接收来自前一层中某些位置的输入。在卷积层中,每个神经元仅从前一层的一个叫做神经元感受野的受限区域接收输入。通常面积是一个正方形(例如。5乘5个神经元)。然而,在一个完全连接的层中,接收场是整个前一层。因此,在每个卷积层中,每个神经元从一个比前一层更大的输入区域获得输入。这是由于一遍又一遍地应用卷积,卷积考虑了像素的值及其周围的像素。当使用扩张层时,接收域中的像素数保持不变,但当结合多层效应时,随着维数的增长,该域的填充变得更加稀疏。

Weights

Each neuron in a neural network computes an output value by applying a specific function to the input values received from the receptive field in the previous layer. The function that is applied to the input values is determined by a vector of weights and a bias (typically real numbers). Learning consists of iteratively adjusting these biases and weights.

Each neuron in a neural network computes an output value by applying a specific function to the input values received from the receptive field in the previous layer. The function that is applied to the input values is determined by a vector of weights and a bias (typically real numbers). Learning consists of iteratively adjusting these biases and weights.

神经网络中的每个神经元通过将一个特定的函数应用于从前一层接收场接收到的输入值来计算输出值。应用于输入值的函数由权重向量和偏差(通常是实数)决定。学习包括迭代地调整这些偏差和权重。

The vector of weights and the bias are called filters and represent particular features of the input (e.g., a particular shape). A distinguishing feature of CNNs is that many neurons can share the same filter. This reduces the memory footprint because a single bias and a single vector of weights are used across all receptive fields that share that filter, as opposed to each receptive field having its own bias and vector weighting.[22]

The vector of weights and the bias are called filters and represent particular features of the input (e.g., a particular shape). A distinguishing feature of CNNs is that many neurons can share the same filter. This reduces the memory footprint because a single bias and a single vector of weights are used across all receptive fields that share that filter, as opposed to each receptive field having its own bias and vector weighting.

权值和偏差的矢量称为滤波器,表示输入的特定特征(例如,一个特定的形状)。CNN 的一个显著特征是许多神经元可以共享相同的过滤器。这减少了内存占用,因为共享该过滤器的所有接收字段都使用单个偏差和单个权重矢量,而不是每个接收字段都有自己的偏差和矢量加权。

History

History

= 历史 =

CNN are often compared to the way the brain achieves vision processing in living organisms.[23][citation needed]

CNN are often compared to the way the brain achieves vision processing in living organisms.

CNN 经常被比作大脑在生物体中实现视觉处理的方式。

Receptive fields in the visual cortex

Work by Hubel and Wiesel in the 1950s and 1960s showed that cat visual cortices contain neurons that individually respond to small regions of the visual field. Provided the eyes are not moving, the region of visual space within which visual stimuli affect the firing of a single neuron is known as its receptive field.[24] Neighboring cells have similar and overlapping receptive fields.[citation needed] Receptive field size and location varies systematically across the cortex to form a complete map of visual space.[citation needed] The cortex in each hemisphere represents the contralateral visual field.[citation needed]

Work by Hubel and Wiesel in the 1950s and 1960s showed that cat visual cortices contain neurons that individually respond to small regions of the visual field. Provided the eyes are not moving, the region of visual space within which visual stimuli affect the firing of a single neuron is known as its receptive field. Neighboring cells have similar and overlapping receptive fields. Receptive field size and location varies systematically across the cortex to form a complete map of visual space. The cortex in each hemisphere represents the contralateral visual field.

20世纪50年代和60年代 Hubel 和 Wisel 的研究表明,猫的视觉皮层包含单独对视野小区域作出反应的神经元。如果眼睛不移动,视觉刺激影响单个神经元放电的视觉空间区域称为其接受区。相邻细胞有相似和重叠的感受野。感受野的大小和位置在整个皮层系统地变化,形成一个完整的视觉空间地图。每个大脑半球的皮层代表对侧的视野。

Their 1968 paper identified two basic visual cell types in the brain:[10]

Their 1968 paper identified two basic visual cell types in the brain:

他们在1968年的论文中确定了大脑中两种基本的视觉细胞类型:

  • simple cells, whose output is maximized by straight edges having particular orientations within their receptive field
  • complex cells, which have larger receptive fields, whose output is insensitive to the exact position of the edges in the field.
  • simple cells, whose output is maximized by straight edges having particular orientations within their receptive field
  • complex cells, which have larger receptive fields, whose output is insensitive to the exact position of the edges in the field.


  • 简单单元,其输出通过在其接受场内具有特定方向的直边最大化
  • 复杂单元,其具有较大的接受场,其输出对场中边的确切位置不敏感。

Hubel and Wiesel also proposed a cascading model of these two types of cells for use in pattern recognition tasks.[25][24]

Hubel and Wiesel also proposed a cascading model of these two types of cells for use in pattern recognition tasks.


Hubel 和 Wisel 还提出了这两种细胞的级联模型,用于模式识别任务。

Neocognitron, origin of the CNN architecture

Neocognitron, origin of the CNN architecture

新认知机,CNN 架构的起源

The "neocognitron"[9] was introduced by Kunihiko Fukushima in 1980.[11][21][26] It was inspired by the above-mentioned work of Hubel and Wiesel. The neocognitron introduced the two basic types of layers in CNNs: convolutional layers, and downsampling layers. A convolutional layer contains units whose receptive fields cover a patch of the previous layer. The weight vector (the set of adaptive parameters) of such a unit is often called a filter. Units can share filters. Downsampling layers contain units whose receptive fields cover patches of previous convolutional layers. Such a unit typically computes the average of the activations of the units in its patch. This downsampling helps to correctly classify objects in visual scenes even when the objects are shifted.

The "neocognitron" was introduced by Kunihiko Fukushima in 1980. It was inspired by the above-mentioned work of Hubel and Wiesel. The neocognitron introduced the two basic types of layers in CNNs: convolutional layers, and downsampling layers. A convolutional layer contains units whose receptive fields cover a patch of the previous layer. The weight vector (the set of adaptive parameters) of such a unit is often called a filter. Units can share filters. Downsampling layers contain units whose receptive fields cover patches of previous convolutional layers. Such a unit typically computes the average of the activations of the units in its patch. This downsampling helps to correctly classify objects in visual scenes even when the objects are shifted.

“新认知机”是福岛国彦在1980年推出的。它的灵感来自于上面提到的 Hubel 和 Wisel 的作品。新认知论介绍了 CNN 中两种基本的层次类型: 卷积层和下采样层。卷积层包含的单位,其接收字段覆盖前一层的一个补丁。这种单元的权重向量(自适应参数集)通常称为滤波器。单位可以共享过滤器。下采样层包含的单位的接收领域覆盖以前卷积层的补丁。这种单位通常计算其补丁中单位的平均激活次数。这种下采样有助于正确分类视觉场景中的对象,即使当对象被移动。

In a variant of the neocognitron called the cresceptron, instead of using Fukushima's spatial averaging, J. Weng et al. introduced a method called max-pooling where a downsampling unit computes the maximum of the activations of the units in its patch.[27] Max-pooling is often used in modern CNNs.[28]

In a variant of the neocognitron called the cresceptron, instead of using Fukushima's spatial averaging, J. Weng et al. introduced a method called max-pooling where a downsampling unit computes the maximum of the activations of the units in its patch. Max-pooling is often used in modern CNNs.

在新认知论的一个变体中,取代了福岛的空间平均法,称为新认知论。引入了一种叫做 max-pool 的方法,其中一个下采样单元计算其补丁中单元的最大激活量。现代有线电视新闻网络中经常使用最大池技术。

Several supervised and unsupervised learning algorithms have been proposed over the decades to train the weights of a neocognitron.[9] Today, however, the CNN architecture is usually trained through backpropagation.

Several supervised and unsupervised learning algorithms have been proposed over the decades to train the weights of a neocognitron. Today, however, the CNN architecture is usually trained through backpropagation.

几十年来,人们提出了几种监督和非监督式学习算法来训练新认知机的重量。然而今天,CNN 的架构通常是通过反向传播来训练的。

The neocognitron is the first CNN which requires units located at multiple network positions to have shared weights.

The neocognitron is the first CNN which requires units located at multiple network positions to have shared weights.

新认知机是第一个要求位于多个网络位置的单位共享权重的 CNN。

Convolutional neural networks were presented at the Neural Information Processing Workshop in 1987, automatically analyzing time-varying signals by replacing learned multiplication with convolution in time, and demonstrated for speech recognition.[29]

Convolutional neural networks were presented at the Neural Information Processing Workshop in 1987, automatically analyzing time-varying signals by replacing learned multiplication with convolution in time, and demonstrated for speech recognition.

卷积神经网络于1987年在神经信息处理研讨会上提出,通过将学习乘法替换为时间卷积来自动分析时变信号,并用于语音识别。

Time delay neural networks

The time delay neural network (TDNN) was introduced in 1987 by Alex Waibel et al. and was one of the first convolutional networks, as it achieved shift invariance.[30] It did so by utilizing weight sharing in combination with backpropagation training.[31] Thus, while also using a pyramidal structure as in the neocognitron, it performed a global optimization of the weights instead of a local one.[30]

The time delay neural network (TDNN) was introduced in 1987 by Alex Waibel et al. and was one of the first convolutional networks, as it achieved shift invariance. It did so by utilizing weight sharing in combination with backpropagation training.Alexander Waibel et al., Phoneme Recognition Using Time-Delay Neural Networks IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume 37, No. 3, pp. 328. - 339 March 1989. Thus, while also using a pyramidal structure as in the neocognitron, it performed a global optimization of the weights instead of a local one.

时滞神经网络(TDNN)是由 Alex Waibel 等人于1987年提出的。是最早的卷积网络之一,因为它实现了平移不变性。它是通过利用权重分配结合反向传播训练来实现的。Alexander Waibel 等人,使用时间延迟神经网络的音素识别 IEEE 声学,语音和信号处理学报,卷37,第10期。3页。328.- 1989年3月339日。因此,虽然也使用金字塔结构在新认知机,它执行了一个全局优化的权重,而不是局部的。

TDNNs are convolutional networks that share weights along the temporal dimension.[32] They allow speech signals to be processed time-invariantly. In 1990 Hampshire and Waibel introduced a variant which performs a two dimensional convolution.[33] Since these TDNNs operated on spectrograms, the resulting phoneme recognition system was invariant to both shifts in time and in frequency. This inspired translation invariance in image processing with CNNs.[31] The tiling of neuron outputs can cover timed stages.[34]

TDNNs are convolutional networks that share weights along the temporal dimension. They allow speech signals to be processed time-invariantly. In 1990 Hampshire and Waibel introduced a variant which performs a two dimensional convolution.John B. Hampshire and Alexander Waibel, Connectionist Architectures for Multi-Speaker Phoneme Recognition, Advances in Neural Information Processing Systems, 1990, Morgan Kaufmann. Since these TDNNs operated on spectrograms, the resulting phoneme recognition system was invariant to both shifts in time and in frequency. This inspired translation invariance in image processing with CNNs. The tiling of neuron outputs can cover timed stages.

TDNN 是沿时间维度共享权重的卷积网络。它们允许对语音信号进行时不变的处理。1990年,汉普郡和威贝尔引进了一个变体,它执行二维卷积。约翰 B 汉普郡和亚历山大 Waibel,连接主义架构的多说话人音素识别,神经信息处理系统的进展,1990年,摩根考夫曼。由于这些 TDNN 在光谱图上工作,所得到的音素识别系统对时间和频率的移位都是不变的。这在图像处理中启发了平移不变性。神经元输出的平铺可以覆盖定时阶段。

TDNNs now achieve the best performance in far distance speech recognition.[35]

TDNNs now achieve the best performance in far distance speech recognition.

TDNN 在远距离语音识别中取得了最好的性能。

Max pooling

In 1990 Yamaguchi et al. introduced the concept of max pooling, which is a fixed filtering operation that calculates and propagates the maximum value of a given region. They did so by combining TDNNs with max pooling in order to realize a speaker independent isolated word recognition system.[20] In their system they used several TDNNs per word, one for each syllable. The results of each TDNN over the input signal were combined using max pooling and the outputs of the pooling layers were then passed on to networks performing the actual word classification.

In 1990 Yamaguchi et al. introduced the concept of max pooling, which is a fixed filtering operation that calculates and propagates the maximum value of a given region. They did so by combining TDNNs with max pooling in order to realize a speaker independent isolated word recognition system. In their system they used several TDNNs per word, one for each syllable. The results of each TDNN over the input signal were combined using max pooling and the outputs of the pooling layers were then passed on to networks performing the actual word classification.

Max pooling

In 1990 Yamaguchi et al.引入了最大池的概念,这是一个固定的过滤操作,计算和传播给定区域的最大值。为了实现与说话人无关的孤立词识别系统,他们将 TDNN 与最大池技术相结合。在他们的系统中,每个单词使用几个 TDNN,每个音节一个。每个 TDNN 在输入信号上的结果使用最大池合并,然后将池层的输出传递给执行实际单词分类的网络。

Image recognition with CNNs trained by gradient descent

A system to recognize hand-written ZIP Code numbers[36] involved convolutions in which the kernel coefficients had been laboriously hand designed.[37]

A system to recognize hand-written ZIP Code numbersDenker, J S, Gardner, W R, Graf, H. P, Henderson, D, Howard, R E, Hubbard, W, Jackel, L D, BaIrd, H S, and Guyon (1989) Neural network recognizer for hand-written zip code digits, AT&T Bell Laboratories involved convolutions in which the kernel coefficients had been laboriously hand designed.Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel, Backpropagation Applied to Handwritten Zip Code Recognition; AT&T Bell Laboratories

= = = 一个识别手写邮政编码的系统/Denker,js,Gardner,wr,Graf,hp,Henderson,d,Howard,ree,Hubbard,wd,Jackel,ld,BaIrd,hs,and Guyon (1989)一个识别手写邮政编码数字的神经网络识别梯度下降法,其中涉及到卷积,其中的核心系数是经过精心设计的。李村,B.Boser,J. S. 丹克尔,D. 亨德森,R. E. 霍华德,W. Hubbard,L.D. 杰克尔,反向传播应用于手写邮政编码识别,贝尔实验室

Yann LeCun et al. (1989)[37] used back-propagation to learn the convolution kernel coefficients directly from images of hand-written numbers. Learning was thus fully automatic, performed better than manual coefficient design, and was suited to a broader range of image recognition problems and image types.

Yann LeCun et al. (1989) used back-propagation to learn the convolution kernel coefficients directly from images of hand-written numbers. Learning was thus fully automatic, performed better than manual coefficient design, and was suited to a broader range of image recognition problems and image types.

Yann LeCun et al.(1989)利用反向传播直接从手写数字图像中学习卷积核系数。因此,学习是完全自动的,比手动系数设计执行得更好,并适合于更广泛的图像识别问题和图像类型。

This approach became a foundation of modern computer vision.

This approach became a foundation of modern computer vision.

这种方法成为现代计算机视觉的基础。

LeNet-5

LeNet-5, a pioneering 7-level convolutional network by LeCun et al. in 1998,[38] that classifies digits, was applied by several banks to recognize hand-written numbers on checks (模板:Lang-en-GB) digitized in 32x32 pixel images. The ability to process higher-resolution images requires larger and more layers of convolutional neural networks, so this technique is constrained by the availability of computing resources.


LeNet-5, a pioneering 7-level convolutional network by LeCun et al. in 1998, that classifies digits, was applied by several banks to recognize hand-written numbers on checks () digitized in 32x32 pixel images. The ability to process higher-resolution images requires larger and more layers of convolutional neural networks, so this technique is constrained by the availability of computing resources.

= = = = LeNet-5 = = = = = LeNet-5,一个开创性的7级卷积网络,由乐村等人。1998年,数字分类技术被几家银行应用于识别32x32像素数字化的支票上的手写数字。处理高分辨率图像的能力需要更大、更多层次的卷积神经网络,因此该技术受到计算资源可用性的限制。

Shift-invariant neural network

Shift-invariant neural network

移位不变神经网络

Similarly, a shift invariant neural network was proposed by W. Zhang et al. for image character recognition in 1988.[2][3] The architecture and training algorithm were modified in 1991[39] and applied for medical image processing[40] and automatic detection of breast cancer in mammograms.[41]

Similarly, a shift invariant neural network was proposed by W. Zhang et al. for image character recognition in 1988. The architecture and training algorithm were modified in 1991 and applied for medical image processing and automatic detection of breast cancer in mammograms.

类似地,张等人提出了一种移位不变神经网络。在一九八八年进行图像字符识别。该算法于1991年进行了结构和训练算法的改进,用于医学图像处理和乳腺癌的自动检测。

A different convolution-based design was proposed in 1988[42] for application to decomposition of one-dimensional electromyography convolved signals via de-convolution. This design was modified in 1989 to other de-convolution-based designs.[43][44]

A different convolution-based design was proposed in 1988Daniel Graupe, Ruey Wen Liu, George S Moschytz."Applications of neural networks to medical signal processing". In Proc. 27th IEEE Decision and Control Conf., pp. 343–347, 1988. for application to decomposition of one-dimensional electromyography convolved signals via de-convolution. This design was modified in 1989 to other de-convolution-based designs.Daniel Graupe, Boris Vern, G. Gruener, Aaron Field, and Qiu Huang. "Decomposition of surface EMG signals into single fiber action potentials by means of neural network". Proc. IEEE International Symp. on Circuits and Systems, pp. 1008–1011, 1989.Qiu Huang, Daniel Graupe, Yi Fang Huang, Ruey Wen Liu."Identification of firing patterns of neuronal signals." In Proc. 28th IEEE Decision and Control Conf., pp. 266–271, 1989. https://ieeexplore.ieee.org/document/70115

1988年 Daniel Graupe,Ruey Wen Liu,George S Moschytz 提出了一种不同的基于卷积的设计方法。神经网络在医学信号处理中的应用。在程序中。第27届 IEEE 决策与控制会议,pp。应用于一维卷积肌电图信号的去卷积分解。该设计在1989年被修改为其他基于去卷积的设计。Daniel Graupe Boris Vern G. Gruener Aaron Field 和邱晃。利用神经网络将表面肌电信号分解为单纤维动作电位。行动。国际交流会。电路与系统。1008-1011,1989。邱晃,丹尼尔 · 格劳普,易芳 · 黄,刘。“识别神经元信号的放电模式”在程序中。第28届 IEEE 决策与控制会议。266-271,1989 https://ieeexplore.ieee.org/document/70115

Neural abstraction pyramid

文件:Neural Abstraction Pyramid.jpg
Neural abstraction pyramid

The feed-forward architecture of convolutional neural networks was extended in the neural abstraction pyramid[45] by lateral and feedback connections. The resulting recurrent convolutional network allows for the flexible incorporation of contextual information to iteratively resolve local ambiguities. In contrast to previous models, image-like outputs at the highest resolution were generated, e.g., for semantic segmentation, image reconstruction, and object localization tasks.

alt=Neural Abstraction Pyramid|thumb|Neural abstraction pyramid The feed-forward architecture of convolutional neural networks was extended in the neural abstraction pyramid by lateral and feedback connections. The resulting recurrent convolutional network allows for the flexible incorporation of contextual information to iteratively resolve local ambiguities. In contrast to previous models, image-like outputs at the highest resolution were generated, e.g., for semantic segmentation, image reconstruction, and object localization tasks.

神经抽象金字塔卷积神经网络的前馈结构通过横向和反馈连接在神经抽象金字塔中得到扩展。由此产生的反复卷积网络允许灵活地结合上下文信息迭代地解决局部模糊。与以前的模型不同,在最高分辨率下生成类似图像的输出,例如,用于语义分割、图像重建和对象定位任务。

GPU implementations

Although CNNs were invented in the 1980s, their breakthrough in the 2000s required fast implementations on graphics processing units (GPUs).

Although CNNs were invented in the 1980s, their breakthrough in the 2000s required fast implementations on graphics processing units (GPUs).

虽然 CNN 是在20世纪80年代发明的,但它们在21世纪的突破需要在图形处理单元(GPU)上快速实现。

In 2004, it was shown by K. S. Oh and K. Jung that standard neural networks can be greatly accelerated on GPUs. Their implementation was 20 times faster than an equivalent implementation on CPU.[46][28] In 2005, another paper also emphasised the value of GPGPU for machine learning.[47]

In 2004, it was shown by K. S. Oh and K. Jung that standard neural networks can be greatly accelerated on GPUs. Their implementation was 20 times faster than an equivalent implementation on CPU. In 2005, another paper also emphasised the value of GPGPU for machine learning.

2004年,K.S.Oh 和 K.Jung 证明,标准的神经网络可以在 GPU 上得到极大的加速。它们的实现速度是 CPU 上同等实现速度的20倍。2005年,另一篇论文也强调了 GPGPU 对机器学习的价值。

The first GPU-implementation of a CNN was described in 2006 by K. Chellapilla et al. Their implementation was 4 times faster than an equivalent implementation on CPU.[48] Subsequent work also used GPUs, initially for other types of neural networks (different from CNNs), especially unsupervised neural networks.[49][50][51][52]

The first GPU-implementation of a CNN was described in 2006 by K. Chellapilla et al. Their implementation was 4 times faster than an equivalent implementation on CPU. Subsequent work also used GPUs, initially for other types of neural networks (different from CNNs), especially unsupervised neural networks.

2006年 K. Chellapilla 等人描述了 CNN 的第一个 GPU 实现。它们的实现速度是 CPU 上同等实现速度的4倍。随后的工作也使用了 GPU,最初用于其他类型的神经网络(不同于 CNN) ,特别是无监督神经网络。

In 2010, Dan Ciresan et al. at IDSIA showed that even deep standard neural networks with many layers can be quickly trained on GPU by supervised learning through the old method known as backpropagation. Their network outperformed previous machine learning methods on the MNIST handwritten digits benchmark.[53] In 2011, they extended this GPU approach to CNNs, achieving an acceleration factor of 60, with impressive results.[18] In 2011, they used such CNNs on GPU to win an image recognition contest where they achieved superhuman performance for the first time.[54] Between May 15, 2011 and September 30, 2012, their CNNs won no less than four image competitions.[55][28] In 2012, they also significantly improved on the best performance in the literature for multiple image databases, including the MNIST database, the NORB database, the HWDB1.0 dataset (Chinese characters) and the CIFAR10 dataset (dataset of 60000 32x32 labeled RGB images).[21]

In 2010, Dan Ciresan et al. at IDSIA showed that even deep standard neural networks with many layers can be quickly trained on GPU by supervised learning through the old method known as backpropagation. Their network outperformed previous machine learning methods on the MNIST handwritten digits benchmark. In 2011, they extended this GPU approach to CNNs, achieving an acceleration factor of 60, with impressive results. In 2011, they used such CNNs on GPU to win an image recognition contest where they achieved superhuman performance for the first time. Between May 15, 2011 and September 30, 2012, their CNNs won no less than four image competitions. In 2012, they also significantly improved on the best performance in the literature for multiple image databases, including the MNIST database, the NORB database, the HWDB1.0 dataset (Chinese characters) and the CIFAR10 dataset (dataset of 60000 32x32 labeled RGB images).

In 2010, Dan Ciresan et al.IDsIA 的研究表明,即使是多层次的深度标准神经网络,也可以通过被称为反向传播的旧方法在图形处理器上快速训练监督式学习。在 MNIST 手写数字基准测试中,他们的网络表现优于以前的机器学习方法。2011年,他们将这种 GPU 方法扩展到了 CNN,加速系数达到了60,并取得了令人印象深刻的结果。2011年,他们在图形处理器(GPU)上使用这种 CNN 赢得了一场图像识别竞赛,首次取得了超人的表现。在2011年5月15日至2012年9月30日期间,他们的 CNN 赢得了至少四次图像竞赛。2012年,他们还显着改善了多个图像数据库文献中的最佳性能,包括 MNIST 数据库,NORB 数据库,HWDB1.0数据集(中文字符)和 CIFAR10数据集(6000032x32标记的 RGB 图像的数据集)。

Subsequently, a similar GPU-based CNN by Alex Krizhevsky et al. won the ImageNet Large Scale Visual Recognition Challenge 2012.[56] A very deep CNN with over 100 layers by Microsoft won the ImageNet 2015 contest.[57]

Subsequently, a similar GPU-based CNN by Alex Krizhevsky et al. won the ImageNet Large Scale Visual Recognition Challenge 2012. A very deep CNN with over 100 layers by Microsoft won the ImageNet 2015 contest.

随后,一个类似的基于 GPU 的 CNN 由 Alex Krizhevsky 等人。赢得了2012年 ImageNet 大规模视觉识别挑战赛。一个非常深入的 CNN 与超过100层的微软赢得了 ImageNet 2015年竞赛。

Intel Xeon Phi implementations

Compared to the training of CNNs using GPUs, not much attention was given to the Intel Xeon Phi coprocessor.[58] A notable development is a parallelization method for training convolutional neural networks on the Intel Xeon Phi, named Controlled Hogwild with Arbitrary Order of Synchronization (CHAOS).[59] CHAOS exploits both the thread- and SIMD-level parallelism that is available on the Intel Xeon Phi.

Compared to the training of CNNs using GPUs, not much attention was given to the Intel Xeon Phi coprocessor.


A notable development is a parallelization method for training convolutional neural networks on the Intel Xeon Phi, named Controlled Hogwild with Arbitrary Order of Synchronization (CHAOS).


CHAOS exploits both the thread- and SIMD-level parallelism that is available on the Intel Xeon Phi.

与使用 GPU 训练 CNN 相比,没有给予 Intel Xeon Phi 协处理器太多的关注。一个值得注意的发展是在 Intel Xeon Phi 上训练卷积神经网络的并行化方法,称为任意阶同步控制 Hogwild (CHAOS)。CHAOS 利用了 Intel Xeon Phi 上的线程级和 SIMD 级并行性。

Distinguishing features

In the past, traditional multilayer perceptron (MLP) models were used for image recognition.模板:Example needed However, the full connectivity between nodes caused the curse of dimensionality, and was computationally intractable with higher-resolution images. A 1000×1000-pixel image with RGB color channels has 3 million weights per fully-connected neuron, which is too high to feasibly process efficiently at scale.

文件:Conv layers.png
CNN layers arranged in 3 dimensions

For example, in CIFAR-10, images are only of size 32×32×3 (32 wide, 32 high, 3 color channels), so a single fully connected neuron in the first hidden layer of a regular neural network would have 32*32*3 = 3,072 weights. A 200×200 image, however, would lead to neurons that have 200*200*3 = 120,000 weights.

In the past, traditional multilayer perceptron (MLP) models were used for image recognition. However, the full connectivity between nodes caused the curse of dimensionality, and was computationally intractable with higher-resolution images. A 1000×1000-pixel image with RGB color channels has 3 million weights per fully-connected neuron, which is too high to feasibly process efficiently at scale. left|thumb|237x237px|CNN layers arranged in 3 dimensions For example, in CIFAR-10, images are only of size 32×32×3 (32 wide, 32 high, 3 color channels), so a single fully connected neuron in the first hidden layer of a regular neural network would have 32*32*3 = 3,072 weights. A 200×200 image, however, would lead to neurons that have 200*200*3 = 120,000 weights.

在过去,传统的多层感知机(MLP)模型被用于图像识别。然而,节点之间的完全连接导致了维数灾难,而且对于高分辨率的图像来说,计算起来很困难。一个1000 × 1000像素的 RGB 彩色通道图像每个完全连接的神经元有300万权重,这是太高的可行性有效处理的规模。例如,在 CIFAR-10中,图像只有32 × 32 × 3(32个宽,32个高,3个颜色通道)的大小,所以在正常神经网络的第一个隐藏层中的单个完全连接的神经元将有32 * 32 * 3 = 3,072个权重。然而,一张200 × 200的图像会产生200 * 200 * 3 = 120,000重量的神经元。

Also, such network architecture does not take into account the spatial structure of data, treating input pixels which are far apart in the same way as pixels that are close together. This ignores locality of reference in data with a grid-topology (such as images), both computationally and semantically. Thus, full connectivity of neurons is wasteful for purposes such as image recognition that are dominated by spatially local input patterns.

Also, such network architecture does not take into account the spatial structure of data, treating input pixels which are far apart in the same way as pixels that are close together. This ignores locality of reference in data with a grid-topology (such as images), both computationally and semantically. Thus, full connectivity of neurons is wasteful for purposes such as image recognition that are dominated by spatially local input patterns.

此外,这种网络结构没有考虑到空间结构的数据,处理输入像素相距很远,以同样的方式像素是紧密在一起。这忽略了使用网格拓扑(如图像)的数据在计算和语义上的访问局部性。因此,完全连接的神经元是浪费的目的,如图像识别,是由空间局部输入模式占主导地位。

Convolutional neural networks are variants of multilayer perceptrons, designed to emulate the behavior of a visual cortex. These models mitigate the challenges posed by the MLP architecture by exploiting the strong spatially local correlation present in natural images. As opposed to MLPs, CNNs have the following distinguishing features:

  • 3D volumes of neurons. The layers of a CNN have neurons arranged in 3 dimensions: width, height and depth.[60] Where each neuron inside a convolutional layer is connected to only a small region of the layer before it, called a receptive field. Distinct types of layers, both locally and completely connected, are stacked to form a CNN architecture.
  • Local connectivity: following the concept of receptive fields, CNNs exploit spatial locality by enforcing a local connectivity pattern between neurons of adjacent layers. The architecture thus ensures that the learned "filters" produce the strongest response to a spatially local input pattern. Stacking many such layers leads to nonlinear filters that become increasingly global (i.e. responsive to a larger region of pixel space) so that the network first creates representations of small parts of the input, then from them assembles representations of larger areas.
  • Shared weights: In CNNs, each filter is replicated across the entire visual field. These replicated units share the same parameterization (weight vector and bias) and form a feature map. This means that all the neurons in a given convolutional layer respond to the same feature within their specific response field. Replicating units in this way allows for the resulting activation map to be equivariant under shifts of the locations of input features in the visual field, i.e. they grant translational equivariance - given that the layer has a stride of one.[61]
  • Pooling: In a CNN's pooling layers, feature maps are divided into rectangular sub-regions, and the features in each rectangle are independently down-sampled to a single value, commonly by taking their average or maximum value. In addition to reducing the sizes of feature maps, the pooling operation grants a degree of local translational invariance to the features contained therein, allowing the CNN to be more robust to variations in their positions.[4]

Convolutional neural networks are variants of multilayer perceptrons, designed to emulate the behavior of a visual cortex. These models mitigate the challenges posed by the MLP architecture by exploiting the strong spatially local correlation present in natural images. As opposed to MLPs, CNNs have the following distinguishing features:

  • 3D volumes of neurons. The layers of a CNN have neurons arranged in 3 dimensions: width, height and depth. Where each neuron inside a convolutional layer is connected to only a small region of the layer before it, called a receptive field. Distinct types of layers, both locally and completely connected, are stacked to form a CNN architecture.
  • Local connectivity: following the concept of receptive fields, CNNs exploit spatial locality by enforcing a local connectivity pattern between neurons of adjacent layers. The architecture thus ensures that the learned "filters" produce the strongest response to a spatially local input pattern. Stacking many such layers leads to nonlinear filters that become increasingly global (i.e. responsive to a larger region of pixel space) so that the network first creates representations of small parts of the input, then from them assembles representations of larger areas.
  • Shared weights: In CNNs, each filter is replicated across the entire visual field. These replicated units share the same parameterization (weight vector and bias) and form a feature map. This means that all the neurons in a given convolutional layer respond to the same feature within their specific response field. Replicating units in this way allows for the resulting activation map to be equivariant under shifts of the locations of input features in the visual field, i.e. they grant translational equivariance - given that the layer has a stride of one.
  • Pooling: In a CNN's pooling layers, feature maps are divided into rectangular sub-regions, and the features in each rectangle are independently down-sampled to a single value, commonly by taking their average or maximum value. In addition to reducing the sizes of feature maps, the pooling operation grants a degree of local translational invariance to the features contained therein, allowing the CNN to be more robust to variations in their positions.

卷积神经网络是多层感知器的变体,旨在模拟视觉皮层的行为。这些模型通过利用自然图像中强烈的空间局部相关性来缓解 MLP 体系结构带来的挑战。与 MLP 相反,CNN 具有以下显著特征:

  • 3D 体积的神经元。细胞神经网络的各层神经元排列成三个维度: 宽度、高度和深度。卷积层内的每个神经元只连接到卷积层前面的一小块区域,称为感受野。不同类型的层,包括局部的和完全连接的,堆叠起来形成 CNN 架构。
  • 局部连通性: 遵循接受场的概念,CNN 通过强制相邻层的神经元之间的局部连通性模式来利用空间局部性。因此,体系结构确保所学的“过滤器”对空间局部输入模式产生最强的响应。叠加许多这样的层导致非线性滤波器变得越来越全局(即。响应更大的像素空间区域) ,以便网络首先创建输入的小部分表示,然后从他们组装更大的区域的表示。
  • 共享权重: 在有线电视网络中,每个过滤器都在整个视野中进行复制。这些复制单元共享相同的参量化(权重矢量和偏差) ,形成一个特征映射。这意味着所有的神经元在给定的卷积层响应相同的特点在他们的特定反应领域。以这种方式复制的单位允许产生的激活映射在视野中输入特征位置的变化下是等变的,即。它们赋予平移等方差-给定层的跨度为1。
  • 汇集: 在 CNN 的汇集层中,特征映射被划分为矩形子区域,每个矩形中的特征被独立地向下抽样到单个值,通常通过取其平均值或最大值。除了减小特征映射的尺寸外,汇总操作还赋予其中包含的特征一定程度的局部平移不变性,使 CNN 对其位置的变化更加稳健。

Together, these properties allow CNNs to achieve better generalization on vision problems. Weight sharing dramatically reduces the number of free parameters learned, thus lowering the memory requirements for running the network and allowing the training of larger, more powerful networks.

Together, these properties allow CNNs to achieve better generalization on vision problems. Weight sharing dramatically reduces the number of free parameters learned, thus lowering the memory requirements for running the network and allowing the training of larger, more powerful networks.

综上所述,这些特性使得 CNN 能够更好地概括视觉问题。权重共享大大减少了学习的自由参数的数量,从而降低了运行网络的内存需求,并允许对更大、更强大的网络进行训练。

Building blocks

模板:More citations needed section

A CNN architecture is formed by a stack of distinct layers that transform the input volume into an output volume (e.g. holding the class scores) through a differentiable function. A few distinct types of layers are commonly used. These are further discussed below.

A CNN architecture is formed by a stack of distinct layers that transform the input volume into an output volume (e.g. holding the class scores) through a differentiable function. A few distinct types of layers are commonly used. These are further discussed below.left|thumb|Neurons of a convolutional layer (blue), connected to their receptive field (red)|229x229px

CNN 体系结构是由一组不同的层组成的,这些层将输入体积转换为输出体积(例如。保持班级成绩)通过一个可微函数。通常使用几种不同类型的图层。这些在下面进一步讨论。左图 | 拇指 | 卷积层(蓝色)的神经元,连接到它们的感受野(红色) | 229x229px

Convolutional layer

The convolutional layer is the core building block of a CNN. The layer's parameters consist of a set of learnable filters (or kernels), which have a small receptive field, but extend through the full depth of the input volume. During the forward pass, each filter is convolved across the width and height of the input volume, computing the dot product between the filter entries and the input, producing a 2-dimensional activation map of that filter. As a result, the network learns filters that activate when it detects some specific type of feature at some spatial position in the input.[62][nb 1]

The convolutional layer is the core building block of a CNN. The layer's parameters consist of a set of learnable filters (or kernels), which have a small receptive field, but extend through the full depth of the input volume. During the forward pass, each filter is convolved across the width and height of the input volume, computing the dot product between the filter entries and the input, producing a 2-dimensional activation map of that filter. As a result, the network learns filters that activate when it detects some specific type of feature at some spatial position in the input., pp. 448When applied to other types of data than image data, such as sound data, "spatial position" may variously correspond to different points in the time domain, frequency domain, or other mathematical spaces.

卷积层是 CNN 的核心组成部分。该层的参数由一组可学习的过滤器(或内核)组成,这些过滤器有一个小的接收场,但是可以延伸到输入体的全部深度。在前向通过过程中,每个过滤器在输入体的宽度和高度上卷积,计算过滤器条目和输入之间的点积,生成该过滤器的二维激活图。因此,网络学习过滤器,激活时,它检测到一些特定类型的特征在某个空间位置的输入,pp。当应用于图像数据以外的其他类型的数据时,如声音数据,“空间位置”可能对应于时间域、频率域或其他数学空间中的不同点。

Stacking the activation maps for all filters along the depth dimension forms the full output volume of the convolution layer. Every entry in the output volume can thus also be interpreted as an output of a neuron that looks at a small region in the input and shares parameters with neurons in the same activation map.

Stacking the activation maps for all filters along the depth dimension forms the full output volume of the convolution layer. Every entry in the output volume can thus also be interpreted as an output of a neuron that looks at a small region in the input and shares parameters with neurons in the same activation map.

沿深度维度叠加所有滤波器的激活映射形成卷积层的全输出体积。因此,输出体积中的每个条目也可以被解释为一个神经元的输出,它观察输入中的一个小区域,并与同一激活图中的神经元共享参数。

Local connectivity

文件:Typical cnn.png
Typical CNN architecture

thumb|395x395px|Typical CNN architecture

= = = 本地连接 = = = = 拇指 | 395x395px | 典型的 CNN 架构

When dealing with high-dimensional inputs such as images, it is impractical to connect neurons to all neurons in the previous volume because such a network architecture does not take the spatial structure of the data into account. Convolutional networks exploit spatially local correlation by enforcing a sparse local connectivity pattern between neurons of adjacent layers: each neuron is connected to only a small region of the input volume.

When dealing with high-dimensional inputs such as images, it is impractical to connect neurons to all neurons in the previous volume because such a network architecture does not take the spatial structure of the data into account. Convolutional networks exploit spatially local correlation by enforcing a sparse local connectivity pattern between neurons of adjacent layers: each neuron is connected to only a small region of the input volume.

在处理图像等高维输入时,将神经元连接到前一卷中的所有神经元是不切实际的,因为这种网络结构没有考虑到数据的空间结构。卷积网络通过在相邻层的神经元之间加强稀疏的局部连接模式来利用空间局部相关性: 每个神经元只连接到输入体积的一个小区域。

The extent of this connectivity is a hyperparameter called the receptive field of the neuron. The connections are local in space (along width and height), but always extend along the entire depth of the input volume. Such an architecture ensures that the learnt filters produce the strongest response to a spatially local input pattern.

The extent of this connectivity is a hyperparameter called the receptive field of the neuron. The connections are local in space (along width and height), but always extend along the entire depth of the input volume. Such an architecture ensures that the learnt filters produce the strongest response to a spatially local input pattern.

这种连接的程度是一个称为神经元感受野的超参数。连接在空间上是局部的(沿着宽度和高度) ,但总是沿着输入体的整个深度延伸。这样的架构确保了所学习的过滤器对空间局部输入模式产生最强的响应。

Spatial arrangement

Spatial arrangement

= = = 空间安排 = = =

Three hyperparameters control the size of the output volume of the convolutional layer: the depth, stride, and padding size:

  • The depth of the output volume controls the number of neurons in a layer that connect to the same region of the input volume. These neurons learn to activate for different features in the input. For example, if the first convolutional layer takes the raw image as input, then different neurons along the depth dimension may activate in the presence of various oriented edges, or blobs of color.
  • Stride controls how depth columns around the width and height are allocated. If the stride is 1, then we move the filters one pixel at a time. This leads to heavily overlapping receptive fields between the columns, and to large output volumes. For any integer [math]\displaystyle{ S \gt 0, }[/math] a stride S means that the filter is translated S units at a time per output. In practice, [math]\displaystyle{ S \geq 3 }[/math] is rare. A greater stride means smaller overlap of receptive fields and smaller spatial dimensions of the output volume.[63]
  • Sometimes, it is convenient to pad the input with zeros (or other values, such as the average of the region) on the border of the input volume. The size of this padding is a third hyperparameter. Padding provides control of the output volume's spatial size. In particular, sometimes it is desirable to exactly preserve the spatial size of the input volume, this is commonly referred to as "same" padding.

Three hyperparameters control the size of the output volume of the convolutional layer: the depth, stride, and padding size:

  • The depth of the output volume controls the number of neurons in a layer that connect to the same region of the input volume. These neurons learn to activate for different features in the input. For example, if the first convolutional layer takes the raw image as input, then different neurons along the depth dimension may activate in the presence of various oriented edges, or blobs of color.
  • Stride controls how depth columns around the width and height are allocated. If the stride is 1, then we move the filters one pixel at a time. This leads to heavily overlapping receptive fields between the columns, and to large output volumes. For any integer S > 0, a stride S means that the filter is translated S units at a time per output. In practice, S \geq 3 is rare. A greater stride means smaller overlap of receptive fields and smaller spatial dimensions of the output volume.
  • Sometimes, it is convenient to pad the input with zeros (or other values, such as the average of the region) on the border of the input volume. The size of this padding is a third hyperparameter. Padding provides control of the output volume's spatial size. In particular, sometimes it is desirable to exactly preserve the spatial size of the input volume, this is commonly referred to as "same" padding.

三个超参数控制卷积层的输出体积的大小: 深度,步幅和填充大小:

  • 输出体积的深度控制连接到输入体积的相同区域的层中的神经元的数量。这些神经元学会激活输入的不同特征。例如,如果第一卷积层将原始图像作为输入,那么沿深度维数的不同神经元可能在存在各种有向边缘或颜色斑点的情况下被激活。
  • Stride 控制如何分配宽度和高度周围的深度列。如果步长为1,那么我们每次移动一个像素。这将导致列之间的接收字段严重重叠,并导致大量输出。对于任何整数 S > 0,跨步 S 意味着过滤器在每次输出时被转换为 S 单位。在实践中,Sgeq3是罕见的。更大的跨度意味着更小的接收域重叠和更小的输出体积的空间尺寸。
  • 有时,在输入卷的边界上用零(或其他值,例如区域的平均值)填充输入是很方便的。这个填充的大小是第三个超参数。填充提供对输出卷空间大小的控制。特别是,有时需要精确地保留输入卷的空间大小,这通常称为“相同”填充。

The spatial size of the output volume is a function of the input volume size [math]\displaystyle{ W }[/math], the kernel field size [math]\displaystyle{ K }[/math] of the convolutional layer neurons, the stride [math]\displaystyle{ S }[/math], and the amount of zero padding [math]\displaystyle{ P }[/math] on the border. The number of neurons that "fit" in a given volume is then:

[math]\displaystyle{ \frac{W-K+2P}{S} + 1. }[/math]

The spatial size of the output volume is a function of the input volume size W, the kernel field size K of the convolutional layer neurons, the stride S, and the amount of zero padding P on the border. The number of neurons that "fit" in a given volume is then:

\frac{W-K+2P}{S} + 1.

输出体积的空间大小是输入体积大小 W、卷积层神经元的核场大小 K、步长 S 和边界上零填充 P 的数量的函数。在给定体积中“适合”的神经元的数目是: : frac { W-K + 2 P }{ S } + 1。

If this number is not an integer, then the strides are incorrect and the neurons cannot be tiled to fit across the input volume in a symmetric way. In general, setting zero padding to be [math]\displaystyle{ P = (K-1)/2 }[/math] when the stride is [math]\displaystyle{ S=1 }[/math] ensures that the input volume and output volume will have the same size spatially. However, it is not always completely necessary to use all of the neurons of the previous layer. For example, a neural network designer may decide to use just a portion of padding.

If this number is not an integer, then the strides are incorrect and the neurons cannot be tiled to fit across the input volume in a symmetric way. In general, setting zero padding to be P = (K-1)/2 when the stride is S=1 ensures that the input volume and output volume will have the same size spatially. However, it is not always completely necessary to use all of the neurons of the previous layer. For example, a neural network designer may decide to use just a portion of padding.

如果这个数字不是一个整数,那么步长就是不正确的,神经元不能平铺以对称的方式适应输入体积。通常,当步长为 S = 1时,将零填充设置为 P = (K-1)/2可以确保输入容量和输出容量在空间上具有相同的大小。然而,并不总是完全有必要使用前一层的所有神经元。例如,一个神经网络设计者可能决定只使用一部分填充。

Parameter sharing

A parameter sharing scheme is used in convolutional layers to control the number of free parameters. It relies on the assumption that if a patch feature is useful to compute at some spatial position, then it should also be useful to compute at other positions. Denoting a single 2-dimensional slice of depth as a depth slice, the neurons in each depth slice are constrained to use the same weights and bias.

A parameter sharing scheme is used in convolutional layers to control the number of free parameters. It relies on the assumption that if a patch feature is useful to compute at some spatial position, then it should also be useful to compute at other positions. Denoting a single 2-dimensional slice of depth as a depth slice, the neurons in each depth slice are constrained to use the same weights and bias.

在卷积层中使用参数共享方案来控制自由参数的数量。它依赖于这样的假设: 如果斑块特征在某个空间位置有用,那么在其他位置也应该有用。将单个二维深度切片表示为深度切片,每个深度切片中的神经元被限制使用相同的权重和偏差。

Since all neurons in a single depth slice share the same parameters, the forward pass in each depth slice of the convolutional layer can be computed as a convolution of the neuron's weights with the input volume.[nb 2] Therefore, it is common to refer to the sets of weights as a filter (or a kernel), which is convolved with the input. The result of this convolution is an activation map, and the set of activation maps for each different filter are stacked together along the depth dimension to produce the output volume. Parameter sharing contributes to the translation invariance of the CNN architecture.[4]

Since all neurons in a single depth slice share the same parameters, the forward pass in each depth slice of the convolutional layer can be computed as a convolution of the neuron's weights with the input volume.hence the name "convolutional layer" Therefore, it is common to refer to the sets of weights as a filter (or a kernel), which is convolved with the input. The result of this convolution is an activation map, and the set of activation maps for each different filter are stacked together along the depth dimension to produce the output volume. Parameter sharing contributes to the translation invariance of the CNN architecture.

由于单个深度切片中的所有神经元共享相同的参数,卷积层的每个深度切片中的前向通过可以计算为神经元权重与输入体积的卷积。因此,名称为“卷积层”。因此,通常将权重集称为过滤器(或内核) ,它与输入卷积。这种卷积的结果是一个激活映射,并且每个不同过滤器的激活映射集合沿着深度维度堆叠在一起以产生输出体积。参数共享有助于提高 CNN 体系结构的平移不变性。

Sometimes, the parameter sharing assumption may not make sense. This is especially the case when the input images to a CNN have some specific centered structure; for which we expect completely different features to be learned on different spatial locations. One practical example is when the inputs are faces that have been centered in the image: we might expect different eye-specific or hair-specific features to be learned in different parts of the image. In that case it is common to relax the parameter sharing scheme, and instead simply call the layer a "locally connected layer".

Sometimes, the parameter sharing assumption may not make sense. This is especially the case when the input images to a CNN have some specific centered structure; for which we expect completely different features to be learned on different spatial locations. One practical example is when the inputs are faces that have been centered in the image: we might expect different eye-specific or hair-specific features to be learned in different parts of the image. In that case it is common to relax the parameter sharing scheme, and instead simply call the layer a "locally connected layer".

有时,参数共享假设可能没有意义。特别是当 CNN 的输入图像有一些特定的中心结构时,我们期望在不同的空间位置学习完全不同的特征。一个实际的例子是,当输入是以图像为中心的面孔时: 我们可能期望在图像的不同部分学习不同的眼睛或头发特征。在这种情况下,通常会放宽参数共享方案,而只是简单地称该层为“局部连接层”。

Pooling layer

文件:Max pooling.png
Max pooling with a 2x2 filter and stride = 2

Another important concept of CNNs is pooling, which is a form of non-linear down-sampling. There are several non-linear functions to implement pooling, where max pooling is the most common. It partitions the input image into a set of rectangles and, for each such sub-region, outputs the maximum.

thumb|314x314px|Max pooling with a 2x2 filter and stride = 2 Another important concept of CNNs is pooling, which is a form of non-linear down-sampling. There are several non-linear functions to implement pooling, where max pooling is the most common. It partitions the input image into a set of rectangles and, for each such sub-region, outputs the maximum.

2x2滤波器和步长 = 2另一个重要的 CNN 概念是池,这是一种非线性下采样的形式。实现池有几个非线性函数,其中最大池是最常见的。它将输入图像分割成一组矩形,并且对于每个这样的子区域,输出最大值。

Intuitively, the exact location of a feature is less important than its rough location relative to other features. This is the idea behind the use of pooling in convolutional neural networks. The pooling layer serves to progressively reduce the spatial size of the representation, to reduce the number of parameters, memory footprint and amount of computation in the network, and hence to also control overfitting. This is known as down-sampling. It is common to periodically insert a pooling layer between successive convolutional layers (each one typically followed by an activation function, such as a ReLU layer) in a CNN architecture.[62]:460–461 While pooling layers contribute to local translation invariance, they do not provide global translation invariance in a CNN, unless a form of global pooling is used.[4][61] The pooling layer commonly operates independently on every depth, or slice, of the input and resizes it spatially. A very common form of max pooling is a layer with filters of size 2×2, applied with a stride of 2, which subsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations:[math]\displaystyle{ f_{X,Y}(S)=\max_{a,b=0}^1S_{2X+a,2Y+b}. }[/math] In this case, every max operation is over 4 numbers. The depth dimension remains unchanged (this is true for other forms of pooling as well).

Intuitively, the exact location of a feature is less important than its rough location relative to other features. This is the idea behind the use of pooling in convolutional neural networks. The pooling layer serves to progressively reduce the spatial size of the representation, to reduce the number of parameters, memory footprint and amount of computation in the network, and hence to also control overfitting. This is known as down-sampling. It is common to periodically insert a pooling layer between successive convolutional layers (each one typically followed by an activation function, such as a ReLU layer) in a CNN architecture. While pooling layers contribute to local translation invariance, they do not provide global translation invariance in a CNN, unless a form of global pooling is used. The pooling layer commonly operates independently on every depth, or slice, of the input and resizes it spatially. A very common form of max pooling is a layer with filters of size 2×2, applied with a stride of 2, which subsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations:f_{X,Y}(S)=\max_{a,b=0}^1S_{2X+a,2Y+b}. In this case, every max operation is over 4 numbers. The depth dimension remains unchanged (this is true for other forms of pooling as well).

直观地说,一个特征的精确位置不如它相对于其他特征的粗糙位置重要。这就是在卷积神经网络中使用池的想法。池层用于逐步减少表示的空间大小,减少参数数量、内存占用和网络中的计算量,从而也控制过拟合。这就是所谓的下采样。在 CNN 架构中,通常会在连续的卷积层之间周期性地插入一个池层(每个卷积层通常后面跟一个激活函数,比如一个 relU 层)。虽然池层有助于局部翻译不变性,但它们不能在 CNN 中提供全局翻译不变性,除非使用某种形式的全局池。池层通常对输入的每个深度或片进行独立操作,并在空间上调整其大小。最大池的一种非常常见的形式是一个大小为2 × 2的滤波器层,以2的步长应用,它沿着宽度和高度对输入中的每个深度切片进行2次采样,丢弃75% 的激活: f _ { X,Y }(S) = max _ { a,b = 0} ^ 1S _ {2X + a,2Y + b }。在这种情况下,每个最大运算超过4个数字。深度维度保持不变(其他形式的池也是如此)。

In addition to max pooling, pooling units can use other functions, such as average pooling or 2-norm pooling. Average pooling was often used historically but has recently fallen out of favor compared to max pooling, which generally performs better in practice.[64]

In addition to max pooling, pooling units can use other functions, such as average pooling or ℓ2-norm pooling. Average pooling was often used historically but has recently fallen out of favor compared to max pooling, which generally performs better in practice.

除了最大池外,池单元还可以使用其他功能,例如平均池或 l < sub > 2 -范数池。平均池通常在历史上使用,但是最近与最大池相比失宠了,最大池在实践中通常表现得更好。

Due to the effects of fast spatial reduction of the size of the representation,模板:Which there is a recent trend towards using smaller filters[65] or discarding pooling layers altogether.[66]

Due to the effects of fast spatial reduction of the size of the representation, there is a recent trend towards using smaller filters or discarding pooling layers altogether.

由于表示大小的快速空间缩减的影响,最近出现了使用较小的过滤器或完全取消池层的趋势。

文件:RoI pooling animated.gif
RoI pooling to size 2x2. In this example region proposal (an input parameter) has size 7x5.

"Region of Interest" pooling (also known as RoI pooling) is a variant of max pooling, in which output size is fixed and input rectangle is a parameter.[67]

thumb|400x300px|RoI pooling to size 2x2. In this example region proposal (an input parameter) has size 7x5. "Region of Interest" pooling (also known as RoI pooling) is a variant of max pooling, in which output size is fixed and input rectangle is a parameter.

拇指 | 400x300px | RoI 池大小为2x2。在这个示例中,区域提案(一个输入参数)的大小为7x5。“感兴趣区域”池(也称为 RoI 池)是最大池的一种变体,其中输出大小是固定的,输入矩形是一个参数。

Pooling is a downsampling method and an important component of convolutional neural networks for object detection based on the Fast R-CNN[68] architecture.

Channel Max Pooling

A CMP operation layer conducts the MP operation along the channel side among the corresponding positions of the consecutive feature maps for the purpose of redundant information elimination. The CMP makes the significant features gather together within fewer channels, which is important for fine-grained image classification that needs more discriminating features. Meanwhile, another advantage of the CMP operation is to make the channel number of feature maps smaller before it connects to the first fully connected (FC) layer. Similar to the MP operation, we denote the input feature maps and output feature maps of a CMP layer as F ∈ R(C×M×N) and C ∈ R(c×M×N), respectively, where C and c are the channel numbers of the input and output feature maps, M and N are the widths and the height of the feature maps, respectively. Note that the CMP operation only changes the channel number of the feature maps. The width and the height of the feature maps are not changed, which is different from the MP operation.[69]

Pooling is a downsampling method and an important component of convolutional neural networks for object detection based on the Fast R-CNN architecture. A CMP operation layer conducts the MP operation along the channel side among the corresponding positions of the consecutive feature maps for the purpose of redundant information elimination. The CMP makes the significant features gather together within fewer channels, which is important for fine-grained image classification that needs more discriminating features. Meanwhile, another advantage of the CMP operation is to make the channel number of feature maps smaller before it connects to the first fully connected (FC) layer. Similar to the MP operation, we denote the input feature maps and output feature maps of a CMP layer as F ∈ R(C×M×N) and C ∈ R(c×M×N), respectively, where C and c are the channel numbers of the input and output feature maps, M and N are the widths and the height of the feature maps, respectively. Note that the CMP operation only changes the channel number of the feature maps. The width and the height of the feature maps are not changed, which is different from the MP operation.

汇集是一种下采样方法,也是基于 Fast R-CNN 架构的目标检测卷积神经网络的重要组成部分。为了消除冗余信息,一个 CMP 操作层在连续特征映射的相应位置之间沿着信道侧进行 MP 操作。CMP 算法使得重要特征在较少的信道内聚集,对于需要更多区分特征的细粒度图像分类具有重要意义。同时,CMP 操作的另一个优点是在连接到第一个完全连接(FC)层之前使特征映射的通道数变小。类似于 MP 运算,我们将 CMP 层的输入特征映射和输出特征映射分别表示为 F ∈ R (C × M × N)和 C ∈ R (c × M × N) ,其中 C 和 c 是输入和输出特征映射的通道数,M 和 N 分别是特征映射的宽度和高度。注意,CMP 操作只改变特性映射的通道号。特征映射的宽度和高度不变,这与 MP 操作不同。

ReLU layer

ReLU is the abbreviation of rectified linear unit, which applies the non-saturating activation function [math]\displaystyle{ f(x)=\max(0,x) }[/math].[56] It effectively removes negative values from an activation map by setting them to zero.[70] It introduces nonlinearities to the decision function and in the overall network without affecting the receptive fields of the convolution layers.

ReLU is the abbreviation of rectified linear unit, which applies the non-saturating activation function f(x)=\max(0,x). It effectively removes negative values from an activation map by setting them to zero. It introduces nonlinearities to the decision function and in the overall network without affecting the receptive fields of the convolution layers.

= = = reLU 层 = = = reLU 是整流线性单位的缩写,它适用于非饱和激活函数 f (x) = max (0,x)。它通过将负值设置为零有效地从激活映射中删除负值。它在不影响卷积层接收场的情况下,对决策函数和整个网络引入非线性。

Other functions can also be used to increase nonlinearity, for example the saturating hyperbolic tangent [math]\displaystyle{ f(x)=\tanh(x) }[/math], [math]\displaystyle{ f(x)=|\tanh(x)| }[/math], and the sigmoid function [math]\displaystyle{ \sigma(x)=(1+e^{-x} )^{-1} }[/math]. ReLU is often preferred to other functions because it trains the neural network several times faster without a significant penalty to generalization accuracy.[71]

Other functions can also be used to increase nonlinearity, for example the saturating hyperbolic tangent f(x)=\tanh(x), f(x)=|\tanh(x)|, and the sigmoid function \sigma(x)=(1+e^{-x} )^{-1}. ReLU is often preferred to other functions because it trains the neural network several times faster without a significant penalty to generalization accuracy.

其他函数也可以用来增加非线性,例如饱和双曲正切 f (x) = tanh (x) ,f (x) = | tanh (x) | 和 S形函数 σ (x) = (1 + e ^ {-x }) ^ {-1}。ReLU 通常优于其他函数,因为它训练神经网络的速度比其他函数快几倍,而且对泛化精度没有显著的影响。

Fully connected layer

After several convolutional and max pooling layers, the final classification is done via fully connected layers. Neurons in a fully connected layer have connections to all activations in the previous layer, as seen in regular (non-convolutional) artificial neural networks. Their activations can thus be computed as an affine transformation, with matrix multiplication followed by a bias offset (vector addition of a learned or fixed bias term).[citation needed]

After several convolutional and max pooling layers, the final classification is done via fully connected layers. Neurons in a fully connected layer have connections to all activations in the previous layer, as seen in regular (non-convolutional) artificial neural networks. Their activations can thus be computed as an affine transformation, with matrix multiplication followed by a bias offset (vector addition of a learned or fixed bias term).

经过几个卷积层和最大池层之后,最终的分类是通过完全连接层完成的。完全连接层的神经元与前一层的所有激活都有联系,正如在常规(非卷积)人工神经网络中看到的那样。因此,它们的激活可以被计算为一个仿射变换,矩阵乘法后跟一个偏置偏移(学习或固定偏置项的向量加法)。

Loss layer

The "loss layer", or "loss function", specifies how training penalizes the deviation between the predicted output of the network, and the true data labels (during supervised learning). Various loss functions can be used, depending on the specific task.


The "loss layer", or "loss function", specifies how training penalizes the deviation between the predicted output of the network, and the true data labels (during supervised learning). Various loss functions can be used, depending on the specific task.

“损失层”,或“损失函数”,指明了训练如何惩罚网络预测输出和真实数据标签(在监督式学习期间)之间的偏差。根据具体的任务,可以使用各种损失函数。

The Softmax loss function is used for predicting a single class of K mutually exclusive classes.[nb 3] Sigmoid cross-entropy loss is used for predicting K independent probability values in [math]\displaystyle{ [0,1] }[/math]. Euclidean loss is used for regressing to real-valued labels [math]\displaystyle{ (-\infty,\infty) }[/math].

The Softmax loss function is used for predicting a single class of K mutually exclusive classes.So-called categorical data. Sigmoid cross-entropy loss is used for predicting K independent probability values in [0,1]. Euclidean loss is used for regressing to real-valued labels (-\infty,\infty).

软最大损失函数用于预测一类 K 互斥类,即所谓的分类数据。在[0,1]中,使用 S 型交叉熵损失来预测 K 独立的概率值。欧几里得损失用于回归到实值标签(- infty,infty)。

Hyperparameters

模板:More citations needed section Hyperparameters are various settings that are used to control the learning process. CNNs use more hyperparameters than a standard multilayer perceptron (MLP).


Hyperparameters are various settings that are used to control the learning process. CNNs use more hyperparameters than a standard multilayer perceptron (MLP).

超参数是用来控制学习过程的各种设置。CNN 比标准多层感知机(MLP)使用更多的超参数。

Kernel size

The kernel is the number of pixels processed together. It is typically expressed as the kernel's dimensions, e.g., 2x2, or 3x3.

The kernel is the number of pixels processed together. It is typically expressed as the kernel's dimensions, e.g., 2x2, or 3x3.

= = = 内核大小 = = = 内核是一起处理的像素数。它通常表示为内核的尺寸,例如2x2或3x3。

Padding

Padding is the addition of (typically) 0-valued pixels on the borders of an image. This is done so that the border pixels are not undervalued (lost) from the output because they would ordinarily participate in only a single receptive field instance. The padding applied is typically one less than the corresponding kernel dimension. For example, a convolutional layer using 3x3 kernels would receive a 2-pixel pad, that is 1 pixel on each side of the image.[72]

Padding is the addition of (typically) 0-valued pixels on the borders of an image. This is done so that the border pixels are not undervalued (lost) from the output because they would ordinarily participate in only a single receptive field instance. The padding applied is typically one less than the corresponding kernel dimension. For example, a convolutional layer using 3x3 kernels would receive a 2-pixel pad, that is 1 pixel on each side of the image.

填充是在图像的边框上添加(通常)0值像素。这样做是为了边框像素不会从输出中被低估(丢失) ,因为它们通常只参与单个接收字段实例。应用的填充通常比相应的内核维度少一个。例如,一个使用3x3内核的卷积层将接收一个2像素的垫,即图像两侧各1个像素。

Stride

The stride is the number of pixels that the analysis window moves on each iteration. A stride of 2 means that each kernel is offset by 2 pixels from its predecessor.

The stride is the number of pixels that the analysis window moves on each iteration. A stride of 2 means that each kernel is offset by 2 pixels from its predecessor.

= = 大步移动 = = 大步移动是分析窗口在每次迭代中移动的像素数。步长为2意味着每个内核与其前一个内核相比偏移了2个像素。

Number of filters

Since feature map size decreases with depth, layers near the input layer tend to have fewer filters while higher layers can have more. To equalize computation at each layer, the product of feature values va with pixel position is kept roughly constant across layers. Preserving more information about the input would require keeping the total number of activations (number of feature maps times number of pixel positions) non-decreasing from one layer to the next.

Since feature map size decreases with depth, layers near the input layer tend to have fewer filters while higher layers can have more. To equalize computation at each layer, the product of feature values va with pixel position is kept roughly constant across layers. Preserving more information about the input would require keeping the total number of activations (number of feature maps times number of pixel positions) non-decreasing from one layer to the next.

由于特征映射的大小随深度减小,接近输入层的图层往往有更少的过滤器,而较高的图层可以有更多的过滤器。为了在每一层均衡计算,特征值与像素位置的乘积在各层之间保持大致不变。保留更多的输入信息将需要保持激活的总数(特征映射的数量乘以像素位置的数量)从一层到下一层不减少。

The number of feature maps directly controls the capacity and depends on the number of available examples and task complexity.

The number of feature maps directly controls the capacity and depends on the number of available examples and task complexity.

特征映射的数量直接控制着容量,并且取决于可用示例的数量和任务的复杂度。

Filter size

Common filter sizes found in the literature vary greatly, and are usually chosen based on the data set.

Common filter sizes found in the literature vary greatly, and are usually chosen based on the data set.

过滤器大小 = = = 文献中常见的过滤器大小差异很大,通常根据数据集来选择。

The challenge is to find the right level of granularity so as to create abstractions at the proper scale, given a particular data set, and without overfitting.

The challenge is to find the right level of granularity so as to create abstractions at the proper scale, given a particular data set, and without overfitting.

挑战在于找到正确的粒度级别,以便在给定特定数据集的情况下,在适当的尺度上创建抽象,而且不会过度拟合。

Pooling type and size

Pooling type and size

= = 池类型和大小 = =

Max pooling is typically used, often with a 2x2 dimension. This implies that the input is drastically downsampled, reducing processing cost.

Max pooling is typically used, often with a 2x2 dimension. This implies that the input is drastically downsampled, reducing processing cost.

通常使用最大池,通常具有2x2维。这意味着输入被大幅度地减少采样,从而降低了处理成本。

Large input volumes may warrant 4×4 pooling in the lower layers.[73] Greater pooling reduces the dimension of the signal, and may result in unacceptable information loss. Often, non-overlapping pooling windows perform best.[64]

Large input volumes may warrant 4×4 pooling in the lower layers. Greater pooling reduces the dimension of the signal, and may result in unacceptable information loss. Often, non-overlapping pooling windows perform best.

大的输入容量可以保证在较低的层4 × 4池。更大的共享降低了信号的维度,并可能导致不可接受的信息丢失。通常,不重叠的池窗口执行得最好。

Dilation

Dilation involves ignoring pixels within a kernel. This reduces processing/memory potentially without significant signal loss. A dilation of 2 on a 3x3 kernel expands the kernel to 7x7, while still processing 9 (evenly spaced) pixels. Accordingly, dilation of 4 expands the kernel to 15x15.[74]

Dilation involves ignoring pixels within a kernel. This reduces processing/memory potentially without significant signal loss. A dilation of 2 on a 3x3 kernel expands the kernel to 7x7, while still processing 9 (evenly spaced) pixels. Accordingly, dilation of 4 expands the kernel to 15x15.

= = 扩展 = = = 扩展涉及到忽略内核中的像素。这可能减少处理/内存,而不会造成显著的信号丢失。在3x3内核上扩展2将内核扩展到7x7,同时仍然处理9(均匀间隔)像素。因此,4的膨胀使内核扩展到15x15。

Translation equivariance and aliasing

It is commonly assumed that CNNs are invariant to shifts of the input. Convolution or pooling layers within a CNN that do not have a stride greater than one are indeed equivariant to translations of the input.[61] However, layers with a stride greater than one ignore the Nyquist-Shannon sampling theorem and might lead to aliasing of the input signal[61] While, in principle, CNNs are capable of implementing anti-aliasing filters, it has been observed that this does not happen in practice [75] and yield models that are not equivariant to translations. Furthermore, if a CNN makes use of fully connected layers, translation equivariance does not imply translation invariance, as the fully connected layers are not invariant to shifts of the input.[76][4] One solution for complete translation invariance is avoiding any down-sampling throughout the network and applying global average pooling at the last layer.[61] Additionally, several other partial solutions have been proposed, such as anti-aliasing before downsampling operations,[77] spatial transformer networks,[78] data augmentation, subsampling combined with pooling,[4] and capsule neural networks.[79]

It is commonly assumed that CNNs are invariant to shifts of the input. Convolution or pooling layers within a CNN that do not have a stride greater than one are indeed equivariant to translations of the input. However, layers with a stride greater than one ignore the Nyquist-Shannon sampling theorem and might lead to aliasing of the input signal While, in principle, CNNs are capable of implementing anti-aliasing filters, it has been observed that this does not happen in practice and yield models that are not equivariant to translations. Furthermore, if a CNN makes use of fully connected layers, translation equivariance does not imply translation invariance, as the fully connected layers are not invariant to shifts of the input. One solution for complete translation invariance is avoiding any down-sampling throughout the network and applying global average pooling at the last layer. Additionally, several other partial solutions have been proposed, such as anti-aliasing before downsampling operations, spatial transformer networks, data augmentation, subsampling combined with pooling, and capsule neural networks.

翻译等方差和别名 = = 通常假设 CNN 对输入的移位是不变的。在 CNN 中,卷积或者汇集层,如果没有大于1的跨度,那么它们确实与输入的翻译是等变的。然而,跨度大于1的图层忽略了 Nyquist-Shannon 的采样定理,可能导致输入信号的混叠。虽然原则上,有线电视新闻网能够实现反混叠过滤器,但是已经观察到这种情况在实践中不会发生,并且产生了与翻译不等变的模型。此外,如果 CNN 使用完全连通的层,平移等方差并不意味着平移不变性,因为完全连通的层对输入的移位不是不变的。完全平移不变性的一种解决方案是避免整个网络的任何下采样,并在最后一层应用全局平均池。此外,还提出了其他一些局部解决方案,如下采样操作前的抗混叠、空间变压器网络、数据增强、子采样结合池和胶囊神经网络。

Evaluation

The accuracy of the final model based on a sub-part of the dataset set apart at the start, often called a test-set. Other times methods such as k-fold cross-validation are applied. Other strategies include using conformal prediction.[80][81]

The accuracy of the final model based on a sub-part of the dataset set apart at the start, often called a test-set. Other times methods such as k-fold cross-validation are applied. Other strategies include using conformal prediction.

最终模型的准确性基于数据集开始时分开的一个子部分,通常称为测试集。其他时候则采用 k 折叠交叉验证等方法。其他策略包括使用保形预测。

Regularization methods

模板:More citations needed section Regularization is a process of introducing additional information to solve an ill-posed problem or to prevent overfitting. CNNs use various types of regularization.


Regularization is a process of introducing additional information to solve an ill-posed problem or to prevent overfitting. CNNs use various types of regularization.

正则化是一个引入额外信息来解决不适定问题或防止过度拟合的过程。CNN 使用各种类型的正则化。

Empirical

Empirical

= 经验出版社 = =

Dropout

Because a fully connected layer occupies most of the parameters, it is prone to overfitting. One method to reduce overfitting is dropout.[82][83] At each training stage, individual nodes are either "dropped out" of the net (ignored) with probability [math]\displaystyle{ 1-p }[/math] or kept with probability [math]\displaystyle{ p }[/math], so that a reduced network is left; incoming and outgoing edges to a dropped-out node are also removed. Only the reduced network is trained on the data in that stage. The removed nodes are then reinserted into the network with their original weights.

Because a fully connected layer occupies most of the parameters, it is prone to overfitting. One method to reduce overfitting is dropout. At each training stage, individual nodes are either "dropped out" of the net (ignored) with probability 1-p or kept with probability p, so that a reduced network is left; incoming and outgoing edges to a dropped-out node are also removed. Only the reduced network is trained on the data in that stage. The removed nodes are then reinserted into the network with their original weights.

因为一个完全连接的层占据了大部分的参数,所以它容易过度拟合。减少过度装配的一种方法是退出。在每个训练阶段,单个节点或者以1-p 的概率从网络中“退出”(被忽略) ,或者以 p 的概率保持,这样就留下了一个简化的网络; 退出节点的传入和传出边也被删除。在这个阶段,只有经过简化的网络对数据进行训练。然后,将移除的节点重新插入到网络中,使用它们的原始权重。

In the training stages, [math]\displaystyle{ p }[/math] is usually 0.5; for input nodes, it is typically much higher because information is directly lost when input nodes are ignored.

In the training stages, p is usually 0.5; for input nodes, it is typically much higher because information is directly lost when input nodes are ignored.

在训练阶段,p 通常是0.5; 对于输入节点,它通常要高得多,因为当忽略输入节点时,信息会直接丢失。

At testing time after training has finished, we would ideally like to find a sample average of all possible [math]\displaystyle{ 2^n }[/math] dropped-out networks; unfortunately this is unfeasible for large values of [math]\displaystyle{ n }[/math]. However, we can find an approximation by using the full network with each node's output weighted by a factor of [math]\displaystyle{ p }[/math], so the expected value of the output of any node is the same as in the training stages. This is the biggest contribution of the dropout method: although it effectively generates [math]\displaystyle{ 2^n }[/math] neural nets, and as such allows for model combination, at test time only a single network needs to be tested.

At testing time after training has finished, we would ideally like to find a sample average of all possible 2^n dropped-out networks; unfortunately this is unfeasible for large values of n. However, we can find an approximation by using the full network with each node's output weighted by a factor of p, so the expected value of the output of any node is the same as in the training stages. This is the biggest contribution of the dropout method: although it effectively generates 2^n neural nets, and as such allows for model combination, at test time only a single network needs to be tested.

在训练结束后的测试时间,我们理想的是找到所有可能的2 ^ n 退出网络的样本平均值; 不幸的是,这对于大的 n 值来说是不可行的。然而,我们可以通过使用完整的网络来找到一个近似值,每个节点的输出加权为 p 因子,所以任何节点的输出的期望值与训练阶段相同。这是辍学方法的最大贡献: 尽管它有效地生成了2 ^ n 个神经网络,因此允许模型组合,但在测试时只需要测试一个网络。

By avoiding training all nodes on all training data, dropout decreases overfitting. The method also significantly improves training speed. This makes the model combination practical, even for deep neural networks. The technique seems to reduce node interactions, leading them to learn more robust features模板:Clarify that better generalize to new data.

By avoiding training all nodes on all training data, dropout decreases overfitting. The method also significantly improves training speed. This makes the model combination practical, even for deep neural networks. The technique seems to reduce node interactions, leading them to learn more robust features that better generalize to new data.

通过避免对所有训练数据上的所有节点进行训练,减少了过拟合。该方法还显著提高了训练速度。这使得模型组合变得实用,甚至对于深层神经网络也是如此。这种技术似乎减少了节点间的交互,使他们能够学习更健壮的特性,从而更好地泛化新数据。

DropConnect

DropConnect

= = DropConnect = =

DropConnect is the generalization of dropout in which each connection, rather than each output unit, can be dropped with probability [math]\displaystyle{ 1-p }[/math]. Each unit thus receives input from a random subset of units in the previous layer.[84]

DropConnect is the generalization of dropout in which each connection, rather than each output unit, can be dropped with probability 1-p. Each unit thus receives input from a random subset of units in the previous layer.

DropConnect 是辍学的推广,其中每个连接(而不是每个输出单元)可以以1-p 的概率辍学。因此,每个单元接收来自前一层单元的随机子集的输入。

DropConnect is similar to dropout as it introduces dynamic sparsity within the model, but differs in that the sparsity is on the weights, rather than the output vectors of a layer. In other words, the fully connected layer with DropConnect becomes a sparsely connected layer in which the connections are chosen at random during the training stage.

DropConnect is similar to dropout as it introduces dynamic sparsity within the model, but differs in that the sparsity is on the weights, rather than the output vectors of a layer. In other words, the fully connected layer with DropConnect becomes a sparsely connected layer in which the connections are chosen at random during the training stage.

DropConnect 类似于辍学,因为它在模型中引入了动态稀疏性,但不同之处在于稀疏性取决于权重,而不是层的输出向量。换句话说,与 DropConnect 完全连接的层变成了一个稀疏连接的层,其中连接是在训练阶段随机选择的。

Stochastic pooling

A major drawback to Dropout is that it does not have the same benefits for convolutional layers, where the neurons are not fully connected.

A major drawback to Dropout is that it does not have the same benefits for convolutional layers, where the neurons are not fully connected.

随机汇集 = = = = = = 辍学的一个主要缺点是,它没有同样的好处卷积层,其中的神经元没有完全连接。

In stochastic pooling,[85] the conventional deterministic pooling operations are replaced with a stochastic procedure, where the activation within each pooling region is picked randomly according to a multinomial distribution, given by the activities within the pooling region. This approach is free of hyperparameters and can be combined with other regularization approaches, such as dropout and data augmentation.

In stochastic pooling, the conventional deterministic pooling operations are replaced with a stochastic procedure, where the activation within each pooling region is picked randomly according to a multinomial distribution, given by the activities within the pooling region. This approach is free of hyperparameters and can be combined with other regularization approaches, such as dropout and data augmentation.

在随机合并中,传统的确定性合并操作被随机过程所取代,其中每个合并区域内的激活按照由合并区域内的活动给出的多项式分布随机挑选。这种方法没有超参数,可以与其他正则化方法结合使用,如退出和数据增强。

An alternate view of stochastic pooling is that it is equivalent to standard max pooling but with many copies of an input image, each having small local deformations. This is similar to explicit elastic deformations of the input images,[86] which delivers excellent performance on the MNIST data set.[86] Using stochastic pooling in a multilayer model gives an exponential number of deformations since the selections in higher layers are independent of those below.

An alternate view of stochastic pooling is that it is equivalent to standard max pooling but with many copies of an input image, each having small local deformations. This is similar to explicit elastic deformations of the input images, which delivers excellent performance on the MNIST data set. Using stochastic pooling in a multilayer model gives an exponential number of deformations since the selections in higher layers are independent of those below.

随机共享的另一种观点是,它等同于标准的 max 共享,但是有许多输入图像的副本,每个副本都有小的局部变形。这类似于输入图像的显式弹性变形,它在 MNIST 数据集上提供出色的性能。在多层模型中使用随机共享给出了指数数量的变形,因为在高层的选择是独立于下面的那些。

Artificial data

Because the degree of model overfitting is determined by both its power and the amount of training it receives, providing a convolutional network with more training examples can reduce overfitting. Because these networks are usually trained with all available data, one approach is to either generate new data from scratch (if possible) or perturb existing data to create new ones. For example, input images can be cropped, rotated, or rescaled to create new examples with the same labels as the original training set.[87]


Because the degree of model overfitting is determined by both its power and the amount of training it receives, providing a convolutional network with more training examples can reduce overfitting. Because these networks are usually trained with all available data, one approach is to either generate new data from scratch (if possible) or perturb existing data to create new ones. For example, input images can be cropped, rotated, or rescaled to create new examples with the same labels as the original training set.

人工数据由于模型过拟合的程度取决于它的功率和它接受的训练量,提供一个具有更多训练样本的卷积网络可以减少过拟合。因为这些网络通常使用所有可用数据进行训练,所以一种方法是从头开始生成新数据(如果可能的话) ,或者扰乱现有数据以创建新数据。例如,可以对输入图像进行裁剪、旋转或重新缩放,以创建具有与原始训练集相同标签的新示例。

Explicit

Explicit

= 显而易见 = =

Early stopping

One of the simplest methods to prevent overfitting of a network is to simply stop the training before overfitting has had a chance to occur. It comes with the disadvantage that the learning process is halted.


One of the simplest methods to prevent overfitting of a network is to simply stop the training before overfitting has had a chance to occur. It comes with the disadvantage that the learning process is halted.

防止网络过度拟合最简单的方法之一就是在过度拟合发生之前停止训练。它带来的缺点是学习过程停滞不前。

Number of parameters

Another simple way to prevent overfitting is to limit the number of parameters, typically by limiting the number of hidden units in each layer or limiting network depth. For convolutional networks, the filter size also affects the number of parameters. Limiting the number of parameters restricts the predictive power of the network directly, reducing the complexity of the function that it can perform on the data, and thus limits the amount of overfitting. This is equivalent to a "zero norm".

Another simple way to prevent overfitting is to limit the number of parameters, typically by limiting the number of hidden units in each layer or limiting network depth. For convolutional networks, the filter size also affects the number of parameters. Limiting the number of parameters restricts the predictive power of the network directly, reducing the complexity of the function that it can perform on the data, and thus limits the amount of overfitting. This is equivalent to a "zero norm".

防止过度拟合的另一个简单方法是限制参数的数量,通常是通过限制每一层中隐藏单元的数量或者限制网络深度。对于卷积网络,滤波器的大小也会影响参数的个数。限制参数的数量直接限制了网络的预测能力,降低了网络对数据执行函数的复杂性,从而限制了过拟合的数量。这相当于一个“零标准”。

Weight decay

A simple form of added regularizer is weight decay, which simply adds an additional error, proportional to the sum of weights (L1 norm) or squared magnitude (L2 norm) of the weight vector, to the error at each node. The level of acceptable model complexity can be reduced by increasing the proportionality constant('alpha' hyperparameter), thus increasing the penalty for large weight vectors.

A simple form of added regularizer is weight decay, which simply adds an additional error, proportional to the sum of weights (L1 norm) or squared magnitude (L2 norm) of the weight vector, to the error at each node. The level of acceptable model complexity can be reduced by increasing the proportionality constant('alpha' hyperparameter), thus increasing the penalty for large weight vectors.

重量衰减 = = = = 一个简单的形式的增加正则化是重量衰减,它只是增加一个额外的误差,成正比的总重量(L1范数)或平方大小(L2范数)的权重向量,在每个节点的误差。可以通过增加比例常数(‘ alpha’超参数)来降低可接受的模型复杂度水平,从而增加对大权向量的惩罚。

L2 regularization is the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. The L2 regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors. Due to multiplicative interactions between weights and inputs this has the useful property of encouraging the network to use all of its inputs a little rather than some of its inputs a lot.

L2 regularization is the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. The L2 regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors. Due to multiplicative interactions between weights and inputs this has the useful property of encouraging the network to use all of its inputs a little rather than some of its inputs a lot.

L2正则化是最常见的正则化形式。它可以通过直接惩罚目标中所有参数的平方大小来实现。L2正则化具有严重惩罚峰值权向量和偏好弥散权向量的直观解释。由于权重和输入之间的乘法相互作用,这具有鼓励网络少使用其所有输入而不是大量使用其某些输入的有用特性。

L1 regularization is also common. It makes the weight vectors sparse during optimization. In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the noisy inputs. L1 with L2 regularization can be combined; this is called elastic net regularization.

L1 regularization is also common. It makes the weight vectors sparse during optimization. In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the noisy inputs. L1 with L2 regularization can be combined; this is called elastic net regularization.

L1正规化也很常见。这使得权重向量在优化过程中变得稀疏。换句话说,具有 L1正则化的神经元最终只使用其最重要输入的一个稀疏子集,并变得几乎不变的噪声输入。L1和 L2正则化可以结合起来,这叫做弹性网正则化。

Max norm constraints

Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint. In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector [math]\displaystyle{ \vec{w} }[/math] of every neuron to satisfy [math]\displaystyle{ \|\vec{w}\|_{2}\lt c }[/math]. Typical values of [math]\displaystyle{ c }[/math] are order of 3–4. Some papers report improvements[88] when using this form of regularization.

Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint. In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector \vec{w} of every neuron to satisfy \|\vec{w}\|_{2}<c. Typical values of c are order of 3–4. Some papers report improvements when using this form of regularization.

另一种正则化形式是对每个神经元的权向量的大小强制一个绝对上界,并使用投影梯度下降法来强制约束。在实践中,这相当于像正常情况一样执行参数更新,然后通过钳制每个神经元的权向量 vec { w }以满足 | vec { w } | _ {2} < c 来强制约束。典型的 c 值是3-4的数量级。一些论文报告了在使用这种正则化形式时的改进。

Hierarchical coordinate frames

Pooling loses the precise spatial relationships between high-level parts (such as nose and mouth in a face image). These relationships are needed for identity recognition. Overlapping the pools so that each feature occurs in multiple pools, helps retain the information. Translation alone cannot extrapolate the understanding of geometric relationships to a radically new viewpoint, such as a different orientation or scale. On the other hand, people are very good at extrapolating; after seeing a new shape once they can recognize it from a different viewpoint.[89]

Pooling loses the precise spatial relationships between high-level parts (such as nose and mouth in a face image). These relationships are needed for identity recognition. Overlapping the pools so that each feature occurs in multiple pools, helps retain the information. Translation alone cannot extrapolate the understanding of geometric relationships to a radically new viewpoint, such as a different orientation or scale. On the other hand, people are very good at extrapolating; after seeing a new shape once they can recognize it from a different viewpoint.

等级坐标框 = = 池失去了精确的高层次部分之间的空间关系(如面部图像中的鼻子和嘴巴)。这些关系是身份识别所必需的。重叠池,以便每个特性都出现在多个池中,这有助于保留信息。翻译本身不能将对几何关系的理解推断到一个全新的视角,例如一个不同的方向或尺度。另一方面,人们非常善于推断,一旦看到一个新的形状,他们可以从不同的角度来认识它。

An earlier common way to deal with this problem is to train the network on transformed data in different orientations, scales, lighting, etc. so that the network can cope with these variations. This is computationally intensive for large data-sets. The alternative is to use a hierarchy of coordinate frames and use a group of neurons to represent a conjunction of the shape of the feature and its pose relative to the retina. The pose relative to the retina is the relationship between the coordinate frame of the retina and the intrinsic features' coordinate frame.[90]

An earlier common way to deal with this problem is to train the network on transformed data in different orientations, scales, lighting, etc. so that the network can cope with these variations. This is computationally intensive for large data-sets. The alternative is to use a hierarchy of coordinate frames and use a group of neurons to represent a conjunction of the shape of the feature and its pose relative to the retina. The pose relative to the retina is the relationship between the coordinate frame of the retina and the intrinsic features' coordinate frame.Rock, Irvin. "The frame of reference." The legacy of Solomon Asch: Essays in cognition and social psychology (1990): 243–268.

处理这个问题的一个较早的常用方法是在不同方向、比例尺、照明等方面对转换后的数据进行网络训练。以便网络能够应付这些变化。这对于大型数据集是计算密集型的。另一种方法是使用坐标框架的层次结构,并使用一组神经元来表示特征的形状及其相对于视网膜的位置的连接。相对于视网膜的姿势是视网膜的坐标框架与内在特征的坐标框架之间的关系。洛克,欧文。“参照系”所罗门 · 阿什的遗产: 认知与社会心理学论文集(1990) : 243-268。

Thus, one way to represent something is to embed the coordinate frame within it. This allows large features to be recognized by using the consistency of the poses of their parts (e.g. nose and mouth poses make a consistent prediction of the pose of the whole face). This approach ensures that the higher-level entity (e.g. face) is present when the lower-level (e.g. nose and mouth) agree on its prediction of the pose. The vectors of neuronal activity that represent pose ("pose vectors") allow spatial transformations modeled as linear operations that make it easier for the network to learn the hierarchy of visual entities and generalize across viewpoints. This is similar to the way the human visual system imposes coordinate frames in order to represent shapes.[91]

Thus, one way to represent something is to embed the coordinate frame within it. This allows large features to be recognized by using the consistency of the poses of their parts (e.g. nose and mouth poses make a consistent prediction of the pose of the whole face). This approach ensures that the higher-level entity (e.g. face) is present when the lower-level (e.g. nose and mouth) agree on its prediction of the pose. The vectors of neuronal activity that represent pose ("pose vectors") allow spatial transformations modeled as linear operations that make it easier for the network to learn the hierarchy of visual entities and generalize across viewpoints. This is similar to the way the human visual system imposes coordinate frames in order to represent shapes.J. Hinton, Coursera lectures on Neural Networks, 2012, Url: https://www.coursera.org/learn/neural-networks

因此,表示事物的一种方法是将坐标框架嵌入其中。这允许通过使用它们各部分姿态的一致性来识别大的特征(例如。鼻子和嘴巴的姿势对整张脸的姿势作出一致的预测)。这种方法可以确保较高级别的实体(例如。在较低层面(例如:。鼻子和嘴巴)同意它对姿势的预测。表示姿势的神经元活动向量(“姿势向量”)允许将空间变换建模为线性操作,这使得网络更容易学习视觉实体的层次结构,并在不同视点之间进行泛化。这类似于人类视觉系统为了表示形状而强加坐标框架的方式。Hinton,Coursera 神经网络讲座,2012,Url: https://www.Coursera.org/learn/Neural-Networks

Applications

Applications

= 应用 =

Image recognition

CNNs are often used in image recognition systems. In 2012 an error rate of 0.23% on the MNIST database was reported.[21] Another paper on using CNN for image classification reported that the learning process was "surprisingly fast"; in the same paper, the best published results as of 2011 were achieved in the MNIST database and the NORB database.[18] Subsequently, a similar CNN called AlexNet[92] won the ImageNet Large Scale Visual Recognition Challenge 2012.

CNNs are often used in image recognition systems. In 2012 an error rate of 0.23% on the MNIST database was reported. Another paper on using CNN for image classification reported that the learning process was "surprisingly fast"; in the same paper, the best published results as of 2011 were achieved in the MNIST database and the NORB database. Subsequently, a similar CNN called AlexNet won the ImageNet Large Scale Visual Recognition Challenge 2012.

图像识别常用于图像识别系统中。2012年 MNIST 数据库的错误率为0.23% 。另一篇关于使用 CNN 进行图像分类的论文报道说,学习过程“惊人地快”; 在同一篇论文中,截至2011年,MNIST 数据库和 NORB 数据库取得了最好的发表结果。随后,一个名为 AlexNet 的类似 CNN 赢得了2012年 ImageNet 大规模视觉识别挑战赛。

When applied to facial recognition, CNNs achieved a large decrease in error rate.[93] Another paper reported a 97.6% recognition rate on "5,600 still images of more than 10 subjects".[12] CNNs were used to assess video quality in an objective way after manual training; the resulting system had a very low root mean square error.[34]

When applied to facial recognition, CNNs achieved a large decrease in error rate. Another paper reported a 97.6% recognition rate on "5,600 still images of more than 10 subjects". CNNs were used to assess video quality in an objective way after manual training; the resulting system had a very low root mean square error.

当应用于人脸识别时,神经网络的错误率大大降低。另一篇论文报告说,对“5600张10个以上受试者的静态图像”的识别率为97.6% 。经过人工训练后,使用有线电视网络以客观的方式评估视频质量,得到的系统平方平均数误差非常小。

The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object classification and detection, with millions of images and hundreds of object classes. In the ILSVRC 2014,[94] a large-scale visual recognition challenge, almost every highly ranked team used CNN as their basic framework. The winner GoogLeNet[95] (the foundation of DeepDream) increased the mean average precision of object detection to 0.439329, and reduced classification error to 0.06656, the best result to date. Its network applied more than 30 layers. That performance of convolutional neural networks on the ImageNet tests was close to that of humans.[96] The best algorithms still struggle with objects that are small or thin, such as a small ant on a stem of a flower or a person holding a quill in their hand. They also have trouble with images that have been distorted with filters, an increasingly common phenomenon with modern digital cameras. By contrast, those kinds of images rarely trouble humans. Humans, however, tend to have trouble with other issues. For example, they are not good at classifying objects into fine-grained categories such as the particular breed of dog or species of bird, whereas convolutional neural networks handle this.[citation needed]

The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object classification and detection, with millions of images and hundreds of object classes. In the ILSVRC 2014, a large-scale visual recognition challenge, almost every highly ranked team used CNN as their basic framework. The winner GoogLeNet (the foundation of DeepDream) increased the mean average precision of object detection to 0.439329, and reduced classification error to 0.06656, the best result to date. Its network applied more than 30 layers. That performance of convolutional neural networks on the ImageNet tests was close to that of humans. The best algorithms still struggle with objects that are small or thin, such as a small ant on a stem of a flower or a person holding a quill in their hand. They also have trouble with images that have been distorted with filters, an increasingly common phenomenon with modern digital cameras. By contrast, those kinds of images rarely trouble humans. Humans, however, tend to have trouble with other issues. For example, they are not good at classifying objects into fine-grained categories such as the particular breed of dog or species of bird, whereas convolutional neural networks handle this.

ImageNet 大规模视觉识别挑战是对象分类和检测的基准,有数百万图像和数百个对象类。在 ILSVRC 2014年的大规模视觉识别挑战中,几乎每个排名靠前的团队都使用 CNN 作为他们的基本框架。获胜者 GoogleNet (DeepDream 的基础)将目标检测的平均精度提高到了0.439329,并将分类错误降低到了0.06656,这是迄今为止最好的结果。它的网络应用了30多个层次。卷积神经网络在 ImageNet 测试中的表现接近人类。最好的算法仍然要与小或细的物体斗争,比如花茎上的小蚂蚁或手里拿着羽毛笔的人。他们也有困难的图像已经失真过滤器,一个越来越普遍的现象与现代数码相机。相比之下,这类图像很少困扰人类。然而,人类往往会遇到其他问题。例如,他们不善于将对象分类为细粒度的类别,如特定品种的狗或鸟类,而卷积神经网络处理这一点。

In 2015 a many-layered CNN demonstrated the ability to spot faces from a wide range of angles, including upside down, even when partially occluded, with competitive performance. The network was trained on a database of 200,000 images that included faces at various angles and orientations and a further 20 million images without faces. They used batches of 128 images over 50,000 iterations.[97]

In 2015 a many-layered CNN demonstrated the ability to spot faces from a wide range of angles, including upside down, even when partially occluded, with competitive performance. The network was trained on a database of 200,000 images that included faces at various angles and orientations and a further 20 million images without faces. They used batches of 128 images over 50,000 iterations.

2015年,一个多层次的 CNN 展示了从广泛的角度识别人脸的能力,包括倒立,即使部分遮挡,具有竞争性的表现。该网络在一个包含20万张图像的数据库中接受了培训,这些图像包括不同角度和方向的人脸以及另外2000万张没有人脸的图像。他们使用了超过50000次迭代的128个图像批处理。

Video analysis

Compared to image data domains, there is relatively little work on applying CNNs to video classification. Video is more complex than images since it has another (temporal) dimension. However, some extensions of CNNs into the video domain have been explored. One approach is to treat space and time as equivalent dimensions of the input and perform convolutions in both time and space.[98][99] Another way is to fuse the features of two convolutional neural networks, one for the spatial and one for the temporal stream.[100][101][102] Long short-term memory (LSTM) recurrent units are typically incorporated after the CNN to account for inter-frame or inter-clip dependencies.[103][104] Unsupervised learning schemes for training spatio-temporal features have been introduced, based on Convolutional Gated Restricted Boltzmann Machines[105] and Independent Subspace Analysis.[106]

Compared to image data domains, there is relatively little work on applying CNNs to video classification. Video is more complex than images since it has another (temporal) dimension. However, some extensions of CNNs into the video domain have been explored. One approach is to treat space and time as equivalent dimensions of the input and perform convolutions in both time and space. Another way is to fuse the features of two convolutional neural networks, one for the spatial and one for the temporal stream.Karpathy, Andrej, et al. "Large-scale video classification with convolutional neural networks." IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2014. (2014). Long short-term memory (LSTM) recurrent units are typically incorporated after the CNN to account for inter-frame or inter-clip dependencies. Unsupervised learning schemes for training spatio-temporal features have been introduced, based on Convolutional Gated Restricted Boltzmann Machines and Independent Subspace Analysis.

与图像数据域相比,将 CNN 应用于视频分类的工作相对较少。视频比图像更复杂,因为它有另一个(时间)维度。然而,一些扩展的 CNN 进入视频领域已经被探索。一种方法是把空间和时间当作输入的等效维数,在时间和空间上进行卷积。另一种方法是融合两个卷积神经网络的特点,一个用于空间和时间流。卡帕西,安德烈,等等。“利用卷积神经网络进行大规模视频分类”IEEE 计算机视觉和模式识别会议(CVPR)。2014.(2014).长短期记忆(LSTM)周期性单位通常在 CNN 之后合并,以解释帧间或剪辑间的依赖关系。介绍了基于卷积门限波尔兹曼机器和独立子空间分析的训练时空特征的非监督式学习方案。

Natural language processing

CNNs have also been explored for natural language processing. CNN models are effective for various NLP problems and achieved excellent results in semantic parsing,[107] search query retrieval,[108] sentence modeling,[109] classification,[110] prediction[111] and other traditional NLP tasks.[112] Compared to traditional language processing methods such as recurrent neural networks, CNNs can represent different contextual realities of language that do not rely on a series-sequence assumption, while RNNs are better suitable when classical time serie modeling is required [113] [114] [115] [116]

CNNs have also been explored for natural language processing. CNN models are effective for various NLP problems and achieved excellent results in semantic parsing, search query retrieval, sentence modeling, classification, predictionCollobert, Ronan, and Jason Weston. "A unified architecture for natural language processing: Deep neural networks with multitask learning."Proceedings of the 25th international conference on Machine learning. ACM, 2008. and other traditional NLP tasks. Compared to traditional language processing methods such as recurrent neural networks, CNNs can represent different contextual realities of language that do not rely on a series-sequence assumption, while RNNs are better suitable when classical time serie modeling is required


= = 自然语言处理 = = CNN 也被用于自然语言处理。CNN 模型能够有效地解决各种自然语言处理问题,在语义分析、搜索查询检索、句子建模、分类、预测、 Collobert、 Ronan 和 Jason Weston 等方面都取得了很好的效果。”自然语言处理的统一架构: 具有多任务学习的深度神经网络。第25届机器学习国际会议论文集。和其他传统的自然语言处理任务。与传统的语言处理方法如递归神经网络相比,神经网络可以表示不依赖于序列假设的语言的不同上下文实际情况,而神经网络更适合于需要经典时间序列建模的情况

Anomaly Detection

A CNN with 1-D convolutions was used on time series in the frequency domain (spectral residual) by an unsupervised model to detect anomalies in the time domain.[117]

A CNN with 1-D convolutions was used on time series in the frequency domain (spectral residual) by an unsupervised model to detect anomalies in the time domain.

在频率域(光谱残差)上使用了一个带有一维卷积的异常检测神经网络,通过一个无监督的模型来检测时间域中的异常。

Drug discovery

CNNs have been used in drug discovery. Predicting the interaction between molecules and biological proteins can identify potential treatments. In 2015, Atomwise introduced AtomNet, the first deep learning neural network for structure-based drug design.[118] The system trains directly on 3-dimensional representations of chemical interactions. Similar to how image recognition networks learn to compose smaller, spatially proximate features into larger, complex structures,[119] AtomNet discovers chemical features, such as aromaticity, sp3 carbons, and hydrogen bonding. Subsequently, AtomNet was used to predict novel candidate biomolecules for multiple disease targets, most notably treatments for the Ebola virus[120] and multiple sclerosis.[121]

CNNs have been used in drug discovery. Predicting the interaction between molecules and biological proteins can identify potential treatments. In 2015, Atomwise introduced AtomNet, the first deep learning neural network for structure-based drug design. The system trains directly on 3-dimensional representations of chemical interactions. Similar to how image recognition networks learn to compose smaller, spatially proximate features into larger, complex structures, AtomNet discovers chemical features, such as aromaticity, sp3 carbons, and hydrogen bonding. Subsequently, AtomNet was used to predict novel candidate biomolecules for multiple disease targets, most notably treatments for the Ebola virus and multiple sclerosis.

= = 药物发现 = = = CNN 已被用于药物发现。预测分子和生物蛋白之间的相互作用可以确定潜在的治疗方法。2015年,Atomwise 引入了 AtomNet,这是第一个用于基于结构的药物设计的深度学习神经网络。该系统直接对化学相互作用的三维表示进行训练。类似于图像识别网络学习如何将较小的、空间相近的特征组合成较大的、复杂的结构,AtomNet 发现了化学特征,如芳香性、 SP < sup > 3 碳和氢键。随后,AtomNet 被用于预测多种疾病靶标的新候选生物分子,最显著的是埃博拉病毒和多发性硬化症的治疗。

Health risk assessment and biomarkers of aging discovery

Health risk assessment and biomarkers of aging discovery

= 健康风险评估和衰老发现的生物标志物 =

CNNs can be naturally tailored to analyze a sufficiently large collection of time series data representing one-week-long human physical activity streams augmented by the rich clinical data (including the death register, as provided by, e.g., the NHANES study). A simple CNN was combined with Cox-Gompertz proportional hazards model and used to produce a proof-of-concept example of digital biomarkers of aging in the form of all-causes-mortality predictor.[122]

CNNs can be naturally tailored to analyze a sufficiently large collection of time series data representing one-week-long human physical activity streams augmented by the rich clinical data (including the death register, as provided by, e.g., the NHANES study). A simple CNN was combined with Cox-Gompertz proportional hazards model and used to produce a proof-of-concept example of digital biomarkers of aging in the form of all-causes-mortality predictor.

通过丰富的临床数据(包括 NHANES 研究提供的死亡登记) ,可以自然地对 CNN 进行定制,以分析代表为期一周的人体足够大的时间序列数据的活动流收集。一个简单的 CNN 与 Cox-Gompertz 比例风险模型相结合,并用于产生一个以全因死亡率预测器形式的数字老化生物标志物的概念验证实例。

Checkers game

CNNs have been used in the game of checkers. From 1999 to 2001, Fogel and Chellapilla published papers showing how a convolutional neural network could learn to play checker using co-evolution. The learning process did not use prior human professional games, but rather focused on a minimal set of information contained in the checkerboard: the location and type of pieces, and the difference in number of pieces between the two sides. Ultimately, the program (Blondie24) was tested on 165 games against players and ranked in the highest 0.4%.[123][124] It also earned a win against the program Chinook at its "expert" level of play.[125]

CNNs have been used in the game of checkers. From 1999 to 2001, Fogel and Chellapilla published papers showing how a convolutional neural network could learn to play checker using co-evolution. The learning process did not use prior human professional games, but rather focused on a minimal set of information contained in the checkerboard: the location and type of pieces, and the difference in number of pieces between the two sides. Ultimately, the program (Blondie24) was tested on 165 games against players and ranked in the highest 0.4%. It also earned a win against the program Chinook at its "expert" level of play.

= = 跳棋游戏 = = 在跳棋游戏中使用了 CNN。从1999年到2001年,福格尔和切拉皮拉发表了论文,展示了卷积神经网络如何通过共同进化学会下棋。学习过程中没有使用以前的人类专业游戏,而是集中在棋盘中包含的最小信息集: 棋子的位置和类型,以及两边棋子数量的差异。最终,这个程序(Blondie24)在165场比赛中对玩家进行了测试,排名最高,达到0.4% 。它还在“专家”级别的比赛中战胜了“支努干”计划。

Go

CNNs have been used in computer Go. In December 2014, Clark and Storkey published a paper showing that a CNN trained by supervised learning from a database of human professional games could outperform GNU Go and win some games against Monte Carlo tree search Fuego 1.1 in a fraction of the time it took Fuego to play.[126] Later it was announced that a large 12-layer convolutional neural network had correctly predicted the professional move in 55% of positions, equalling the accuracy of a 6 dan human player. When the trained convolutional network was used directly to play games of Go, without any search, it beat the traditional search program GNU Go in 97% of games, and matched the performance of the Monte Carlo tree search program Fuego simulating ten thousand playouts (about a million positions) per move.[127]

CNNs have been used in computer Go. In December 2014, Clark and Storkey published a paper showing that a CNN trained by supervised learning from a database of human professional games could outperform GNU Go and win some games against Monte Carlo tree search Fuego 1.1 in a fraction of the time it took Fuego to play. Later it was announced that a large 12-layer convolutional neural network had correctly predicted the professional move in 55% of positions, equalling the accuracy of a 6 dan human player. When the trained convolutional network was used directly to play games of Go, without any search, it beat the traditional search program GNU Go in 97% of games, and matched the performance of the Monte Carlo tree search program Fuego simulating ten thousand playouts (about a million positions) per move.

= = Go = = 在电脑 Go 中使用了 CNN。2014年12月,克拉克和斯托基发表的一篇论文显示,由人类职业游戏数据库中的监督式学习训练出来的美国有线电视新闻网(CNN)可以胜过 GNU 围棋,并且在与蒙特卡洛树搜索 Fuego 1.1的比赛中赢得一些比赛,而这只需要 Fuego 玩游戏所花时间的一小部分。后来有消息称,一个12层的大型卷积神经网络在55% 的位置上准确预测了职业移动,相当于一个6段人类选手的准确率。当经过训练的卷积网络被直接用于围棋游戏时,没有任何搜索,它在97% 的游戏中击败了传统的搜索程序 GNU Go,并且匹配了蒙特卡洛树搜索程序 Fuego 模拟的每一步10000场(约100万个位置)的性能。

A couple of CNNs for choosing moves to try ("policy network") and evaluating positions ("value network") driving MCTS were used by AlphaGo, the first to beat the best human player at the time.[128]

A couple of CNNs for choosing moves to try ("policy network") and evaluating positions ("value network") driving MCTS were used by AlphaGo, the first to beat the best human player at the time.

AlphaGo 使用了几个 CNN 网络来选择尝试的动作(“策略网络”)和评估位置(“价值网络”)来驱动 MCTS,它是第一个击败当时最好的人类玩家的网络。

Time series forecasting

Recurrent neural networks are generally considered the best neural network architectures for time series forecasting (and sequence modeling in general), but recent studies show that convolutional networks can perform comparably or even better.[129][8] Dilated convolutions[130] might enable one-dimensional convolutional neural networks to effectively learn time series dependences.[131] Convolutions can be implemented more efficiently than RNN-based solutions, and they do not suffer from vanishing (or exploding) gradients.[132] Convolutional networks can provide an improved forecasting performance when there are multiple similar time series to learn from.[133] CNNs can also be applied to further tasks in time series analysis (e.g., time series classification[134] or quantile forecasting[135]).

Recurrent neural networks are generally considered the best neural network architectures for time series forecasting (and sequence modeling in general), but recent studies show that convolutional networks can perform comparably or even better. Dilated convolutions might enable one-dimensional convolutional neural networks to effectively learn time series dependences. Convolutions can be implemented more efficiently than RNN-based solutions, and they do not suffer from vanishing (or exploding) gradients. Convolutional networks can provide an improved forecasting performance when there are multiple similar time series to learn from. CNNs can also be applied to further tasks in time series analysis (e.g., time series classification or quantile forecasting).

回归神经网络通常被认为是时间序列预测(和一般序列建模)的最佳神经网络结构,但最近的研究表明,卷积网络的性能可以相当,甚至更好。扩张卷积可以使一维卷积神经网络有效地学习时间序列的依赖性。卷积可以比基于 RNN 的解决方案更有效地实现,而且它们不会受到消失(或爆炸)梯度的影响。当有多个相似的时间序列可供学习时,卷积网络可以提供更好的预测性能。CNN 还可以应用于时间序列分析(例如,时间序列分类或分位数预测)中的进一步任务。

Cultural Heritage and 3D-datasets

As archaeological findings like clay tablets with cuneiform writing are increasingly acquired using 3D scanners first benchmark datasets are becoming available like HeiCuBeDa[136] providing almost 2.000 normalized 2D- and 3D-datasets prepared with the GigaMesh Software Framework.[137] So curvature-based measures are used in conjunction with Geometric Neural Networks (GNNs) e.g. for period classification of those clay tablets being among the oldest documents of human history.[138][139]

As archaeological findings like clay tablets with cuneiform writing are increasingly acquired using 3D scanners first benchmark datasets are becoming available like HeiCuBeDa providing almost 2.000 normalized 2D- and 3D-datasets prepared with the GigaMesh Software Framework. So curvature-based measures are used in conjunction with Geometric Neural Networks (GNNs) e.g. for period classification of those clay tablets being among the oldest documents of human history.

= = = 文化遗产和3D 数据集 = = = = 由于考古发现,如楔形文字的泥板越来越多地使用3D 扫描仪获得第一个基准数据集正在变得可用,如 HeiCuBeDa 提供近2000标准化的2D 和3D 数据集与 GigaMesh 软件框架准备。因此,基于曲率的测量与几何神经网络(GNN)结合使用。这些泥板是人类历史上最古老的文献之一。

Fine-tuning

For many applications, the training data is less available. Convolutional neural networks usually require a large amount of training data in order to avoid overfitting. A common technique is to train the network on a larger data set from a related domain. Once the network parameters have converged an additional training step is performed using the in-domain data to fine-tune the network weights, this is known as transfer learning. Furthermore, this technique allows convolutional network architectures to successfully be applied to problems with tiny training sets.[140]

For many applications, the training data is less available. Convolutional neural networks usually require a large amount of training data in order to avoid overfitting. A common technique is to train the network on a larger data set from a related domain. Once the network parameters have converged an additional training step is performed using the in-domain data to fine-tune the network weights, this is known as transfer learning. Furthermore, this technique allows convolutional network architectures to successfully be applied to problems with tiny training sets.Durjoy Sen Maitra; Ujjwal Bhattacharya; S.K. Parui, "CNN based common approach to handwritten character recognition of multiple scripts", in Document Analysis and Recognition (ICDAR), 2015 13th International Conference on, vol., no., pp.1021–1025, 23–26 Aug. 2015

= = 微调 = = 对于许多应用程序来说,可用的训练数据较少。为了避免过度拟合,卷积神经网络通常需要大量的训练数据。一种常见的技术是在来自相关领域的较大数据集上对网络进行训练。一旦网络参数收敛,使用域内数据执行一个额外的训练步骤来微调网络权重,这就是所谓的传输学习。此外,这种技术允许卷积网络体系结构成功地应用于具有微小训练集的问题。Ujjwal Bhattacharya.2015年8月23-26日,第13届国际会议,第1021-1025页,文献分析与识别(ICDAR) ,Parui,“基于 CNN 的多种文字手写字符识别通用方法”

Human interpretable explanations

End-to-end training and prediction are common practice in computer vision. However, human interpretable explanations are required for critical systems such as a self-driving cars.[141] With recent advances in visual salience, spatial attention, and temporal attention, the most critical spatial regions/temporal instants could be visualized to justify the CNN predictions.[142][143]

End-to-end training and prediction are common practice in computer vision. However, human interpretable explanations are required for critical systems such as a self-driving cars. With recent advances in visual salience, spatial attention, and temporal attention, the most critical spatial regions/temporal instants could be visualized to justify the CNN predictions.

= = 人类可解释的解释 = = = 端到端的训练和预测是计算机视觉中的常见做法。然而,对于诸如自动驾驶汽车这样的关键系统,人类需要可解释的解释。随着视觉显著性、空间注意和时间注意的最新进展,最关键的空间区域/时间瞬间可以被可视化,以证明 CNN 的预测是正确的。

Related architectures

Related architectures

= 相关架构 =

Deep Q-networks

A deep Q-network (DQN) is a type of deep learning model that combines a deep neural network with Q-learning, a form of reinforcement learning. Unlike earlier reinforcement learning agents, DQNs that utilize CNNs can learn directly from high-dimensional sensory inputs via reinforcement learning.[144]

A deep Q-network (DQN) is a type of deep learning model that combines a deep neural network with Q-learning, a form of reinforcement learning. Unlike earlier reinforcement learning agents, DQNs that utilize CNNs can learn directly from high-dimensional sensory inputs via reinforcement learning.

= = = 深度 Q 网络 = = = 深度 Q 网络(Deep Q-network,DQN)是一种深度学习模型,它将深度神经网络与 Q 学习(一种强化学习)结合在一起。与早期的强化学习不同,利用 CNN 的 dqN 可以通过强化学习直接从高维感觉输入中学习。

Preliminary results were presented in 2014, with an accompanying paper in February 2015.[145] The research described an application to Atari 2600 gaming. Other deep reinforcement learning models preceded it.[146]

Preliminary results were presented in 2014, with an accompanying paper in February 2015. The research described an application to Atari 2600 gaming. Other deep reinforcement learning models preceded it.

2014年提交了初步结果,2015年2月提交了一份附带文件。这项研究描述了雅达利2600游戏的应用。在此之前还有其他深强化学习模型。

Deep belief networks

Convolutional deep belief networks (CDBN) have structure very similar to convolutional neural networks and are trained similarly to deep belief networks. Therefore, they exploit the 2D structure of images, like CNNs do, and make use of pre-training like deep belief networks. They provide a generic structure that can be used in many image and signal processing tasks. Benchmark results on standard image datasets like CIFAR[147] have been obtained using CDBNs.[148]


Convolutional deep belief networks (CDBN) have structure very similar to convolutional neural networks and are trained similarly to deep belief networks. Therefore, they exploit the 2D structure of images, like CNNs do, and make use of pre-training like deep belief networks. They provide a generic structure that can be used in many image and signal processing tasks. Benchmark results on standard image datasets like CIFAR have been obtained using CDBNs.

卷积深度信念网络(CDBN)具有与卷积神经网络非常相似的结构,并且与深度信念网络训练类似。因此,他们像 CNN 一样利用图像的二维结构,像深度信念网络一样利用预训练。它们提供了一个通用的结构,可用于许多图像和信号处理任务。使用 CDBN 可以获得标准图像数据集(如 CIFAR)的基准结果。

Notable libraries

  • Caffe: A library for convolutional neural networks. Created by the Berkeley Vision and Learning Center (BVLC). It supports both CPU and GPU. Developed in C++, and has Python and MATLAB wrappers.
  • Deeplearning4j: Deep learning in Java and Scala on multi-GPU-enabled Spark. A general-purpose deep learning library for the JVM production stack running on a C++ scientific computing engine. Allows the creation of custom layers. Integrates with Hadoop and Kafka.
  • Dlib: A toolkit for making real world machine learning and data analysis applications in C++.
  • Microsoft Cognitive Toolkit: A deep learning toolkit written by Microsoft with several unique features enhancing scalability over multiple nodes. It supports full-fledged interfaces for training in C++ and Python and with additional support for model inference in C# and Java.
  • TensorFlow: Apache 2.0-licensed Theano-like library with support for CPU, GPU, Google's proprietary tensor processing unit (TPU),[149] and mobile devices.
  • Theano: The reference deep-learning library for Python with an API largely compatible with the popular NumPy library. Allows user to write symbolic mathematical expressions, then automatically generates their derivatives, saving the user from having to code gradients or backpropagation. These symbolic expressions are automatically compiled to CUDA code for a fast, on-the-GPU implementation.
  • Torch: A scientific computing framework with wide support for machine learning algorithms, written in C and Lua.
  • Caffe: A library for convolutional neural networks. Created by the Berkeley Vision and Learning Center (BVLC). It supports both CPU and GPU. Developed in C++, and has Python and MATLAB wrappers.
  • Deeplearning4j: Deep learning in Java and Scala on multi-GPU-enabled Spark. A general-purpose deep learning library for the JVM production stack running on a C++ scientific computing engine. Allows the creation of custom layers. Integrates with Hadoop and Kafka.
  • Dlib: A toolkit for making real world machine learning and data analysis applications in C++.
  • Microsoft Cognitive Toolkit: A deep learning toolkit written by Microsoft with several unique features enhancing scalability over multiple nodes. It supports full-fledged interfaces for training in C++ and Python and with additional support for model inference in C# and Java.
  • TensorFlow: Apache 2.0-licensed Theano-like library with support for CPU, GPU, Google's proprietary tensor processing unit (TPU), and mobile devices.
  • Theano: The reference deep-learning library for Python with an API largely compatible with the popular NumPy library. Allows user to write symbolic mathematical expressions, then automatically generates their derivatives, saving the user from having to code gradients or backpropagation. These symbolic expressions are automatically compiled to CUDA code for a fast, on-the-GPU implementation.
  • Torch: A scientific computing framework with wide support for machine learning algorithms, written in C and Lua.

= 著名图书馆 =

  • 咖啡馆: 一个用于卷积神经网络的图书馆。由伯克利视觉与学习中心(BVLC)创建。它同时支持 CPU 和 GPU。用 C + + 开发,并具有 Python 和 MATLAB 包装器。
  • Deeplearning4j: Java 和 Scala 中支持多 GPU 的 Spark 的深度学习。用于在 C + + 科学计算引擎上运行的 JVM 产品堆栈的通用深度学习库。允许创建自定义层。集成 Hadoop 和 Kafka。
  • Dlib: 用 C + + 编写真实世界机器学习和数据分析应用程序的工具包。
  • 微软认知工具包: 微软编写的一个深度学习工具包,具有几个独特的功能,增强了多个节点的可伸缩性。它支持用于 C + + 和 Python 培训的成熟接口,以及用于 C # 和 Java 的模型推理的附加支持。
  • TensorFlow: Apache 2.0授权的 Theano-like 库,支持 CPU、 GPU、 Google 的专有 TPU (TPU)和移动设备。
  • Theano: Python 的参考深度学习库,其 API 基本上与流行的 NumPy 库兼容。允许用户编写符号数学表达式,然后自动生成它们的导数,节省了用户必须编码梯度或反向传播。这些符号表达式会自动编译成 CUDA 代码,用于快速、基于 GPU 的实现。
  • Torch: 一个广泛支持机器学习算法的科学计算框架,用 C 和 Lua 编写。

See also

  • Attention (machine learning)
  • Convolution
  • Deep learning
  • Natural-language processing
  • Neocognitron
  • Scale-invariant feature transform
  • Time delay neural network
  • Vision processing unit

= 参见同样 =

  • 注意力(机器学习)
  • 卷积
  • 深度学习
  • 自然语言处理
  • 新认知尺度不变特征转换
  • 延时神经网络
  • 视觉处理单元

Notes

  1. When applied to other types of data than image data, such as sound data, "spatial position" may variously correspond to different points in the time domain, frequency domain, or other mathematical spaces.
  2. hence the name "convolutional layer"
  3. So-called categorical data.

References

  1. Valueva, M.V.; Nagornov, N.N.; Lyakhov, P.A.; Valuev, G.V.; Chervyakov, N.I. (2020). "Application of the residue number system to reduce hardware costs of the convolutional neural network implementation". Mathematics and Computers in Simulation. Elsevier BV. 177: 232–243. doi:10.1016/j.matcom.2020.04.031. ISSN 0378-4754. S2CID 218955622. Convolutional neural networks are a promising tool for solving the problem of pattern recognition.
  2. 2.0 2.1 Zhang, Wei (1988). "Shift-invariant pattern recognition neural network and its optical architecture". Proceedings of Annual Conference of the Japan Society of Applied Physics.
  3. 3.0 3.1 Zhang, Wei (1990). "Parallel distributed processing model with local space-invariant interconnections and its optical architecture". Applied Optics. 29 (32): 4790–7. Bibcode:1990ApOpt..29.4790Z. doi:10.1364/AO.29.004790. PMID 20577468.
  4. 4.0 4.1 4.2 4.3 4.4 4.5 Mouton, Coenraad; Myburgh, Johannes C.; Davel, Marelie H. (2020). Gerber, Aurona (ed.). "Stride and Translation Invariance in CNNs". Artificial Intelligence Research. Communications in Computer and Information Science (in English). Cham: Springer International Publishing. 1342: 267–281. arXiv:2103.10097. doi:10.1007/978-3-030-66151-9_17. ISBN 978-3-030-66151-9. S2CID 232269854.
  5. van den Oord, Aaron; Dieleman, Sander; Schrauwen, Benjamin (2013-01-01). Burges, C. J. C.. ed. Deep content-based music recommendation. Curran Associates, Inc.. pp. 2643–2651. https://proceedings.neurips.cc/paper/2013/file/b3ba8f1bee1238a2f37603d90b58898d-Paper.pdf. 
  6. Collobert, Ronan; Weston, Jason (2008-01-01). A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. ICML '08. New York, NY, USA: ACM. pp. 160–167. doi:10.1145/1390156.1390177. ISBN 978-1-60558-205-4. 
  7. Avilov, Oleksii; Rimbert, Sebastien; Popov, Anton; Bougrain, Laurent (July 2020). "Deep Learning Techniques to Improve Intraoperative Awareness Detection from Electroencephalographic Signals". 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). Montreal, QC, Canada: IEEE. 2020: 142–145. doi:10.1109/EMBC44109.2020.9176228. ISBN 978-1-7281-1990-8. PMID 33017950. S2CID 221386616.
  8. 8.0 8.1 Tsantekidis, Avraam; Passalis, Nikolaos; Tefas, Anastasios; Kanniainen, Juho; Gabbouj, Moncef; Iosifidis, Alexandros (July 2017). "Forecasting Stock Prices from the Limit Order Book Using Convolutional Neural Networks". 2017 IEEE 19th Conference on Business Informatics (CBI). Thessaloniki, Greece: IEEE: 7–12. doi:10.1109/CBI.2017.23. ISBN 978-1-5386-3035-8. S2CID 4950757.
  9. 9.0 9.1 9.2 Fukushima, K. (2007). "Neocognitron". Scholarpedia. 2 (1): 1717. Bibcode:2007SchpJ...2.1717F. doi:10.4249/scholarpedia.1717.
  10. 10.0 10.1 Hubel, D. H.; Wiesel, T. N. (1968-03-01). "Receptive fields and functional architecture of monkey striate cortex". The Journal of Physiology. 195 (1): 215–243. doi:10.1113/jphysiol.1968.sp008455. ISSN 0022-3751. PMC 1557912. PMID 4966457.
  11. 11.0 11.1 Fukushima, Kunihiko (1980). "Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position" (PDF). Biological Cybernetics. 36 (4): 193–202. doi:10.1007/BF00344251. PMID 7370364. S2CID 206775608. Retrieved 16 November 2013.
  12. 12.0 12.1 Matusugu, Masakazu; Katsuhiko Mori; Yusuke Mitari; Yuji Kaneda (2003). "Subject independent facial expression recognition with robust face detection using a convolutional neural network" (PDF). Neural Networks. 16 (5): 555–559. doi:10.1016/S0893-6080(03)00115-1. PMID 12850007. Retrieved 17 November 2013.
  13. Ian Goodfellow and Yoshua Bengio and Aaron Courville (2016). Deep Learning. MIT Press. p. 326. https://www.deeplearningbook.org/. 
  14. "Convolutional Neural Networks (LeNet) – DeepLearning 0.1 documentation". DeepLearning 0.1. LISA Lab. Retrieved 31 August 2013.
  15. Habibi, Aghdam, Hamed (2017-05-30). Guide to convolutional neural networks : a practical application to traffic-sign detection and classification. Heravi, Elnaz Jahani. Cham, Switzerland. ISBN 9783319575490. OCLC 987790957. 
  16. Venkatesan, Ragav; Li, Baoxin (2017-10-23) (in en). Convolutional Neural Networks in Visual Computing: A Concise Guide. CRC Press. ISBN 978-1-351-65032-8. https://books.google.com/books?id=bAM7DwAAQBAJ&q=vanishing+gradient. 
  17. Balas, Valentina E.; Kumar, Raghvendra; Srivastava, Rajshree (2019-11-19) (in en). Recent Trends and Advances in Artificial Intelligence and Internet of Things. Springer Nature. ISBN 978-3-030-32644-9. https://books.google.com/books?id=XRS_DwAAQBAJ&q=exploding+gradient. 
  18. 18.0 18.1 18.2 Ciresan, Dan; Ueli Meier; Jonathan Masci; Luca M. Gambardella; Jurgen Schmidhuber (2011). "Flexible, High Performance Convolutional Neural Networks for Image Classification" (PDF). Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence-Volume Volume Two. 2: 1237–1242. Retrieved 17 November 2013.
  19. Krizhevsky, Alex. "ImageNet Classification with Deep Convolutional Neural Networks" (PDF). Retrieved 17 November 2013.
  20. 20.0 20.1 Yamaguchi, Kouichi; Sakamoto, Kenji; Akabane, Toshio; Fujimoto, Yoshiji (November 1990). A Neural Network for Speaker-Independent Isolated Word Recognition. First International Conference on Spoken Language Processing (ICSLP 90). Kobe, Japan.
  21. 21.0 21.1 21.2 21.3 Ciresan, Dan; Meier, Ueli; Schmidhuber, Jürgen (June 2012). Multi-column deep neural networks for image classification. New York, NY: Institute of Electrical and Electronics Engineers (IEEE). pp. 3642–3649. arXiv:1202.2745. doi:10.1109/CVPR.2012.6248110. ISBN 978-1-4673-1226-4. OCLC 812295155. 
  22. LeCun, Yann. "LeNet-5, convolutional neural networks". Retrieved 16 November 2013.
  23. Mahapattanakul, Puttatida (November 11, 2019). "From Human Vision to Computer Vision — Convolutional Neural Network(Part3/4)". Medium.
  24. 24.0 24.1 Hubel, DH; Wiesel, TN (October 1959). "Receptive fields of single neurones in the cat's striate cortex". J. Physiol. 148 (3): 574–91. doi:10.1113/jphysiol.1959.sp006308. PMC 1363130. PMID 14403679.
  25. David H. Hubel and Torsten N. Wiesel (2005). Brain and visual perception: the story of a 25-year collaboration. Oxford University Press US. p. 106. ISBN 978-0-19-517618-6. https://books.google.com/books?id=8YrxWojxUA4C&pg=PA106. 
  26. LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey (2015). "Deep learning". Nature. 521 (7553): 436–444. Bibcode:2015Natur.521..436L. doi:10.1038/nature14539. PMID 26017442. S2CID 3074096.
  27. Weng, J; Ahuja, N; Huang, TS (1993). "Learning recognition and segmentation of 3-D objects from 2-D images". Proc. 4th International Conf. Computer Vision: 121–128. doi:10.1109/ICCV.1993.378228. ISBN 0-8186-3870-2. S2CID 8619176.
  28. 28.0 28.1 28.2 Schmidhuber, Jürgen (2015). "Deep Learning". Scholarpedia. 10 (11): 1527–54. CiteSeerX 10.1.1.76.1541. doi:10.1162/neco.2006.18.7.1527. PMID 16764513. S2CID 2309950.
  29. Homma, Toshiteru; Les Atlas; Robert Marks II (1988). "An Artificial Neural Network for Spatio-Temporal Bipolar Patters: Application to Phoneme Classification" (PDF). Advances in Neural Information Processing Systems. 1: 31–40.
  30. 30.0 30.1 Waibel, Alex (December 1987). Phoneme Recognition Using Time-Delay Neural Networks. Meeting of the Institute of Electrical, Information and Communication Engineers (IEICE). Tokyo, Japan.
  31. 31.0 31.1 Alexander Waibel et al., Phoneme Recognition Using Time-Delay Neural Networks IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume 37, No. 3, pp. 328. - 339 March 1989.
  32. LeCun, Yann; Bengio, Yoshua (1995). "Convolutional networks for images, speech, and time series". In Arbib, Michael A. (ed.). The handbook of brain theory and neural networks (Second ed.). The MIT press. pp. 276–278.
  33. John B. Hampshire and Alexander Waibel, Connectionist Architectures for Multi-Speaker Phoneme Recognition, Advances in Neural Information Processing Systems, 1990, Morgan Kaufmann.
  34. 34.0 34.1 Le Callet, Patrick; Christian Viard-Gaudin; Dominique Barba (2006). "A Convolutional Neural Network Approach for Objective Video Quality Assessment" (PDF). IEEE Transactions on Neural Networks. 17 (5): 1316–1327. doi:10.1109/TNN.2006.879766. PMID 17001990. S2CID 221185563. Retrieved 17 November 2013.
  35. Ko, Tom; Peddinti, Vijayaditya; Povey, Daniel; Seltzer, Michael L.; Khudanpur, Sanjeev (March 2018). A Study on Data Augmentation of Reverberant Speech for Robust Speech Recognition (PDF). The 42nd IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017). New Orleans, LA, USA.
  36. Denker, J S, Gardner, W R, Graf, H. P, Henderson, D, Howard, R E, Hubbard, W, Jackel, L D, BaIrd, H S, and Guyon (1989) Neural network recognizer for hand-written zip code digits, AT&T Bell Laboratories
  37. 37.0 37.1 Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel, Backpropagation Applied to Handwritten Zip Code Recognition; AT&T Bell Laboratories
  38. LeCun, Yann; Léon Bottou; Yoshua Bengio; Patrick Haffner (1998). "Gradient-based learning applied to document recognition" (PDF). Proceedings of the IEEE. 86 (11): 2278–2324. CiteSeerX 10.1.1.32.9552. doi:10.1109/5.726791. S2CID 14542261. Retrieved October 7, 2016.
  39. Zhang, Wei (1991). "Error Back Propagation with Minimum-Entropy Weights: A Technique for Better Generalization of 2-D Shift-Invariant NNs". Proceedings of the International Joint Conference on Neural Networks.
  40. Zhang, Wei (1991). "Image processing of human corneal endothelium based on a learning network". Applied Optics. 30 (29): 4211–7. Bibcode:1991ApOpt..30.4211Z. doi:10.1364/AO.30.004211. PMID 20706526.
  41. Zhang, Wei (1994). "Computerized detection of clustered microcalcifications in digital mammograms using a shift-invariant artificial neural network". Medical Physics. 21 (4): 517–24. Bibcode:1994MedPh..21..517Z. doi:10.1118/1.597177. PMID 8058017.
  42. Daniel Graupe, Ruey Wen Liu, George S Moschytz."Applications of neural networks to medical signal processing". In Proc. 27th IEEE Decision and Control Conf., pp. 343–347, 1988.
  43. Daniel Graupe, Boris Vern, G. Gruener, Aaron Field, and Qiu Huang. "Decomposition of surface EMG signals into single fiber action potentials by means of neural network". Proc. IEEE International Symp. on Circuits and Systems, pp. 1008–1011, 1989.
  44. Qiu Huang, Daniel Graupe, Yi Fang Huang, Ruey Wen Liu."Identification of firing patterns of neuronal signals." In Proc. 28th IEEE Decision and Control Conf., pp. 266–271, 1989. https://ieeexplore.ieee.org/document/70115
  45. Behnke, Sven (2003). Hierarchical Neural Networks for Image Interpretation. Lecture Notes in Computer Science. 2766. Springer. doi:10.1007/b11963. ISBN 978-3-540-40722-5. https://www.ais.uni-bonn.de/books/LNCS2766.pdf. 
  46. Oh, KS; Jung, K (2004). "GPU implementation of neural networks". Pattern Recognition. 37 (6): 1311–1314. Bibcode:2004PatRe..37.1311O. doi:10.1016/j.patcog.2004.01.013.
  47. Dave Steinkraus; Patrice Simard; Ian Buck (2005). "Using GPUs for Machine Learning Algorithms". 12th International Conference on Document Analysis and Recognition (ICDAR 2005). pp. 1115–1119. doi:10.1109/ICDAR.2005.251.
  48. Kumar Chellapilla; Sid Puri; Patrice Simard (2006). "High Performance Convolutional Neural Networks for Document Processing". In Lorette, Guy. Tenth International Workshop on Frontiers in Handwriting Recognition. Suvisoft. https://hal.inria.fr/inria-00112631/document. 
  49. Hinton, GE; Osindero, S; Teh, YW (Jul 2006). "A fast learning algorithm for deep belief nets". Neural Computation. 18 (7): 1527–54. CiteSeerX 10.1.1.76.1541. doi:10.1162/neco.2006.18.7.1527. PMID 16764513. S2CID 2309950.
  50. Bengio, Yoshua; Lamblin, Pascal; Popovici, Dan; Larochelle, Hugo (2007). "Greedy Layer-Wise Training of Deep Networks" (PDF). Advances in Neural Information Processing Systems: 153–160.
  51. Ranzato, MarcAurelio; Poultney, Christopher; Chopra, Sumit; LeCun, Yann (2007). "Efficient Learning of Sparse Representations with an Energy-Based Model" (PDF). Advances in Neural Information Processing Systems.
  52. Raina, R; Madhavan, A; Ng, Andrew (2009). "Large-scale deep unsupervised learning using graphics processors" (PDF). ICML: 873–880.
  53. Ciresan, Dan; Meier, Ueli; Gambardella, Luca; Schmidhuber, Jürgen (2010). "Deep big simple neural nets for handwritten digit recognition". Neural Computation. 22 (12): 3207–3220. arXiv:1003.0358. doi:10.1162/NECO_a_00052. PMID 20858131. S2CID 1918673.
  54. "IJCNN 2011 Competition result table". OFFICIAL IJCNN2011 COMPETITION (in English). 2010. Retrieved 2019-01-14.
  55. Schmidhuber, Jürgen (17 March 2017). "History of computer vision contests won by deep CNNs on GPU" (in English). Retrieved 14 January 2019.
  56. 56.0 56.1 Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E. (2017-05-24). "ImageNet classification with deep convolutional neural networks" (PDF). Communications of the ACM. 60 (6): 84–90. doi:10.1145/3065386. ISSN 0001-0782. S2CID 195908774.
  57. He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). "Deep Residual Learning for Image Recognition" (PDF). 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 770–778. arXiv:1512.03385. doi:10.1109/CVPR.2016.90. ISBN 978-1-4673-8851-1. S2CID 206594692.
  58. Viebke, Andre; Pllana, Sabri (2015). "The Potential of the Intel (R) Xeon Phi for Supervised Deep Learning". 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems. IEEE Xplore. IEEE 2015. pp. 758–765. doi:10.1109/HPCC-CSS-ICESS.2015.45. ISBN 978-1-4799-8937-9. S2CID 15411954.
  59. Viebke, Andre; Memeti, Suejb; Pllana, Sabri; Abraham, Ajith (2019). "CHAOS: a parallelization scheme for training convolutional neural networks on Intel Xeon Phi". The Journal of Supercomputing. 75 (1): 197–227. arXiv:1702.07908. doi:10.1007/s11227-017-1994-x. S2CID 14135321.
  60. Hinton, Geoffrey (2012). "ImageNet Classification with Deep Convolutional Neural Networks". NIPS'12: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1. 1: 1097–1105 – via ACM.
  61. 61.0 61.1 61.2 61.3 61.4 Azulay, Aharon; Weiss, Yair (2019). "Why do deep convolutional networks generalize so poorly to small image transformations?". Journal of Machine Learning Research. 20 (184): 1–25. ISSN 1533-7928.
  62. 62.0 62.1 Géron, Aurélien (2019). Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow. Sebastopol, CA: O'Reilly Media. ISBN 978-1-492-03264-9. , pp. 448
  63. "CS231n Convolutional Neural Networks for Visual Recognition". cs231n.github.io. Retrieved 2017-04-25.
  64. 64.0 64.1 Scherer, Dominik; Müller, Andreas C.; Behnke, Sven (2010). "Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition" (PDF). Artificial Neural Networks (ICANN), 20th International Conference on. Thessaloniki, Greece: Springer. pp. 92–101.
  65. Graham, Benjamin (2014-12-18). "Fractional Max-Pooling". arXiv:1412.6071 [cs.CV].
  66. Springenberg, Jost Tobias; Dosovitskiy, Alexey; Brox, Thomas; Riedmiller, Martin (2014-12-21). "Striving for Simplicity: The All Convolutional Net". arXiv:1412.6806 [cs.LG].
  67. Grel, Tomasz (2017-02-28). "Region of interest pooling explained". deepsense.io (in English). Retrieved 5 April 2017.
  68. Girshick, Ross (2015-09-27). "Fast R-CNN". arXiv:1504.08083 [cs.CV].
  69. Ma, Zhanyu; Chang, Dongliang; Xie, Jiyang; Ding, Yifeng; Wen, Shaoguo; Li, Xiaoxu; Si, Zhongwei; Guo, Jun (2019). "Fine-Grained Vehicle Classification With Channel Max Pooling Modified CNNs". IEEE Transactions on Vehicular Technology. Institute of Electrical and Electronics Engineers (IEEE). 68 (4): 3224–3233. doi:10.1109/tvt.2019.2899972. ISSN 0018-9545.
  70. Romanuke, Vadim (2017). "Appropriate number and allocation of ReLUs in convolutional neural networks". Research Bulletin of NTUU "Kyiv Polytechnic Institute". 1: 69–78. doi:10.20535/1810-0546.2017.1.88156.
  71. Krizhevsky, A.; Sutskever, I.; Hinton, G. E. (2012). "Imagenet classification with deep convolutional neural networks" (PDF). Advances in Neural Information Processing Systems. 1: 1097–1105.
  72. "6.3. Padding and Stride — Dive into Deep Learning 0.17.0 documentation". d2l.ai. Retrieved 2021-08-12.
  73. Deshpande, Adit. "The 9 Deep Learning Papers You Need To Know About (Understanding CNNs Part 3)". adeshpande3.github.io. Retrieved 2018-12-04.
  74. Seo, Jae Duk (2018-03-12). "Understanding 2D Dilated Convolution Operation with Examples in Numpy and Tensorflow with…". Medium (in English). Retrieved 2021-08-12.
  75. Ribeiro,Schon, Antonio,Thomas (2021). "How Convolutional Neural Networks Deal with Aliasing". IEEE International Conference on Acoustics, Speech and Signal Processing: 2755–2759. arXiv:2102.07757. doi:10.1109/ICASSP39728.2021.9414627. ISBN 978-1-7281-7605-5. S2CID 231925012.
  76. Myburgh, Johannes C.; Mouton, Coenraad; Davel, Marelie H. (2020). Gerber, Aurona (ed.). "Tracking Translation Invariance in CNNs". Artificial Intelligence Research. Communications in Computer and Information Science (in English). Cham: Springer International Publishing. 1342: 282–295. arXiv:2104.05997. doi:10.1007/978-3-030-66151-9_18. ISBN 978-3-030-66151-9. S2CID 233219976.
  77. Richard, Zhang (2019-04-25). Making Convolutional Networks Shift-Invariant Again. OCLC 1106340711. https://www.worldcat.org/oclc/1106340711. 
  78. Jadeberg, Simonyan, Zisserman, Kavukcuoglu, Max, Karen, Andrew, Koray (2015). "Spatial Transformer Networks" (PDF). Advances in Neural Information Processing Systems. 28 – via NIPS.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  79. E, Sabour, Sara Frosst, Nicholas Hinton, Geoffrey (2017-10-26). Dynamic Routing Between Capsules. OCLC 1106278545. https://worldcat.org/oclc/1106278545. 
  80. Matiz, Sergio; Barner, Kenneth E. (2019-06-01). "Inductive conformal predictor for convolutional neural networks: Applications to active learning for image classification". Pattern Recognition (in English). 90: 172–182. Bibcode:2019PatRe..90..172M. doi:10.1016/j.patcog.2019.01.035. ISSN 0031-3203. S2CID 127253432.
  81. Wieslander, Håkan; Harrison, Philip J.; Skogberg, Gabriel; Jackson, Sonya; Fridén, Markus; Karlsson, Johan; Spjuth, Ola; Wählby, Carolina (February 2021). "Deep Learning With Conformal Prediction for Hierarchical Analysis of Large-Scale Whole-Slide Tissue Images". IEEE Journal of Biomedical and Health Informatics. 25 (2): 371–380. doi:10.1109/JBHI.2020.2996300. ISSN 2168-2208. PMID 32750907. S2CID 219885788.
  82. Srivastava, Nitish; C. Geoffrey Hinton; Alex Krizhevsky; Ilya Sutskever; Ruslan Salakhutdinov (2014). "Dropout: A Simple Way to Prevent Neural Networks from overfitting" (PDF). Journal of Machine Learning Research. 15 (1): 1929–1958.
  83. Carlos E. Perez. "A Pattern Language for Deep Learning".
  84. "Regularization of Neural Networks using DropConnect | ICML 2013 | JMLR W&CP". jmlr.org: 1058–1066. 2013-02-13. Retrieved 2015-12-17.
  85. Zeiler, Matthew D.; Fergus, Rob (2013-01-15). "Stochastic Pooling for Regularization of Deep Convolutional Neural Networks". arXiv:1301.3557 [cs.LG].
  86. 86.0 86.1 Platt, John; Steinkraus, Dave; Simard, Patrice Y. (August 2003). "Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis – Microsoft Research". Microsoft Research. Retrieved 2015-12-17.
  87. Hinton, Geoffrey E.; Srivastava, Nitish; Krizhevsky, Alex; Sutskever, Ilya; Salakhutdinov, Ruslan R. (2012). "Improving neural networks by preventing co-adaptation of feature detectors". arXiv:1207.0580 [cs.NE].
  88. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting". jmlr.org. Retrieved 2015-12-17.
  89. Hinton, Geoffrey (1979). "Some demonstrations of the effects of structural descriptions in mental imagery". Cognitive Science. 3 (3): 231–250. doi:10.1016/s0364-0213(79)80008-7.
  90. Rock, Irvin. "The frame of reference." The legacy of Solomon Asch: Essays in cognition and social psychology (1990): 243–268.
  91. J. Hinton, Coursera lectures on Neural Networks, 2012, Url: https://www.coursera.org/learn/neural-networks -{zh-cn:互联网档案馆; zh-tw:網際網路檔案館; zh-hk:互聯網檔案館;}-存檔,存档日期2016-12-31.
  92. Dave Gershgorn (18 June 2018). "The inside story of how AI got good enough to dominate Silicon Valley". Quartz. Retrieved 5 October 2018.
  93. Lawrence, Steve; C. Lee Giles; Ah Chung Tsoi; Andrew D. Back (1997). "Face Recognition: A Convolutional Neural Network Approach". IEEE Transactions on Neural Networks. 8 (1): 98–113. CiteSeerX 10.1.1.92.5813. doi:10.1109/72.554195. PMID 18255614.
  94. "ImageNet Large Scale Visual Recognition Competition 2014 (ILSVRC2014)". Retrieved 30 January 2016.
  95. Szegedy, Christian; Liu, Wei; Jia, Yangqing; Sermanet, Pierre; Reed, Scott E.; Anguelov, Dragomir; Erhan, Dumitru; Vanhoucke, Vincent; Rabinovich, Andrew (2015). "Going deeper with convolutions". IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7–12, 2015. IEEE Computer Society. pp. 1–9. arXiv:1409.4842. doi:10.1109/CVPR.2015.7298594.
  96. Russakovsky, Olga; Deng, Jia; Su, Hao; Krause, Jonathan; Satheesh, Sanjeev; Ma, Sean; Huang, Zhiheng; Karpathy, Andrej; Khosla, Aditya; Bernstein, Michael; Berg, Alexander C.; Fei-Fei, Li (2014). "Image Net Large Scale Visual Recognition Challenge". arXiv:1409.0575 [cs.CV].
  97. "The Face Detection Algorithm Set To Revolutionize Image Search". Technology Review. February 16, 2015. Retrieved 27 October 2017.
  98. Baccouche, Moez; Mamalet, Franck; Wolf, Christian; Garcia, Christophe; Baskurt, Atilla (2011-11-16). "Sequential Deep Learning for Human Action Recognition". In Salah, Albert Ali. Human Behavior Unterstanding. Lecture Notes in Computer Science. 7065. Springer Berlin Heidelberg. pp. 29–39. doi:10.1007/978-3-642-25446-8_4. ISBN 978-3-642-25445-1. 
  99. Ji, Shuiwang; Xu, Wei; Yang, Ming; Yu, Kai (2013-01-01). "3D Convolutional Neural Networks for Human Action Recognition". IEEE Transactions on Pattern Analysis and Machine Intelligence. 35 (1): 221–231. CiteSeerX 10.1.1.169.4046. doi:10.1109/TPAMI.2012.59. ISSN 0162-8828. PMID 22392705. S2CID 1923924.
  100. Huang, Jie; Zhou, Wengang; Zhang, Qilin; Li, Houqiang; Li, Weiping (2018). "Video-based Sign Language Recognition without Temporal Segmentation". arXiv:1801.10111 [cs.CV].
  101. Karpathy, Andrej, et al. "Large-scale video classification with convolutional neural networks." IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2014.
  102. Simonyan, Karen; Zisserman, Andrew (2014). "Two-Stream Convolutional Networks for Action Recognition in Videos". arXiv:1406.2199 [cs.CV]. (2014).
  103. Wang, Le; Duan, Xuhuan; Zhang, Qilin; Niu, Zhenxing; Hua, Gang; Zheng, Nanning (2018-05-22). "Segment-Tube: Spatio-Temporal Action Localization in Untrimmed Videos with Per-Frame Segmentation" (PDF). Sensors. 18 (5): 1657. Bibcode:2018Senso..18.1657W. doi:10.3390/s18051657. ISSN 1424-8220. PMC 5982167. PMID 29789447.
  104. Duan, Xuhuan; Wang, Le; Zhai, Changbo; Zheng, Nanning; Zhang, Qilin; Niu, Zhenxing; Hua, Gang (2018). "Joint Spatio-Temporal Action Localization in Untrimmed Videos with Per-Frame Segmentation". 2018 25th IEEE International Conference on Image Processing (ICIP). 25th IEEE International Conference on Image Processing (ICIP). pp. 918–922. doi:10.1109/icip.2018.8451692. ISBN 978-1-4799-7061-2.
  105. Taylor, Graham W.; Fergus, Rob; LeCun, Yann; Bregler, Christoph (2010-01-01). Convolutional Learning of Spatio-temporal Features. Proceedings of the 11th European Conference on Computer Vision: Part VI. ECCV'10. Berlin, Heidelberg: Springer-Verlag. pp. 140–153. ISBN 978-3-642-15566-6.
  106. Le, Q. V.; Zou, W. Y.; Yeung, S. Y.; Ng, A. Y. (2011-01-01). Learning Hierarchical Invariant Spatio-temporal Features for Action Recognition with Independent Subspace Analysis. CVPR '11. Washington, DC, USA: IEEE Computer Society. pp. 3361–3368. doi:10.1109/CVPR.2011.5995496. ISBN 978-1-4577-0394-2. 
  107. Grefenstette, Edward; Blunsom, Phil; de Freitas, Nando; Hermann, Karl Moritz (2014-04-29). "A Deep Architecture for Semantic Parsing". arXiv:1404.7296 [cs.CL].
  108. Mesnil, Gregoire; Deng, Li; Gao, Jianfeng; He, Xiaodong; Shen, Yelong (April 2014). "Learning Semantic Representations Using Convolutional Neural Networks for Web Search – Microsoft Research". Microsoft Research. Retrieved 2015-12-17.
  109. Kalchbrenner, Nal; Grefenstette, Edward; Blunsom, Phil (2014-04-08). "A Convolutional Neural Network for Modelling Sentences". arXiv:1404.2188 [cs.CL].
  110. Kim, Yoon (2014-08-25). "Convolutional Neural Networks for Sentence Classification". arXiv:1408.5882 [cs.CL].
  111. Collobert, Ronan, and Jason Weston. "A unified architecture for natural language processing: Deep neural networks with multitask learning."Proceedings of the 25th international conference on Machine learning. ACM, 2008.
  112. Collobert, Ronan; Weston, Jason; Bottou, Leon; Karlen, Michael; Kavukcuoglu, Koray; Kuksa, Pavel (2011-03-02). "Natural Language Processing (almost) from Scratch". arXiv:1103.0398 [cs.LG].
  113. Yin, W; Kann, K; Yu, M; Schütze, H (2017-03-02). "Comparative study of CNN and RNN for natural language processing". arXiv:1702.01923 [cs.LG].
  114. Bai, S.; Kolter, J.S.; Koltun, V. (2018). "An empirical evaluation of generic convolutional and recurrent networks for sequence modeling". arXiv:1803.01271 [cs.LG].
  115. Gruber, N. (2021). "Detecting dynamics of action in text with a recurrent neural network". Neural Computing and Applications. 33 (12): 15709–15718. doi:10.1007/S00521-021-06190-5. S2CID 236307579.
  116. Haotian, J.; Zhong, Li; Qianxiao, Li (2021). "Approximation Theory of Convolutional Architectures for Time Series Modelling". International Conference on Machine Learning. arXiv:2107.09355.
  117. Ren, Hansheng; Xu, Bixiong; Wang, Yujing; Yi, Chao; Huang, Congrui; Kou, Xiaoyu; Xing, Tony; Yang, Mao; Tong, Jie; Zhang, Qi (2019). Time-Series Anomaly Detection Service at Microsoft | Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (in English). arXiv:1906.03821. doi:10.1145/3292500.3330680. S2CID 182952311.
  118. Wallach, Izhar; Dzamba, Michael; Heifets, Abraham (2015-10-09). "AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery". arXiv:1510.02855 [cs.LG].
  119. Yosinski, Jason; Clune, Jeff; Nguyen, Anh; Fuchs, Thomas; Lipson, Hod (2015-06-22). "Understanding Neural Networks Through Deep Visualization". arXiv:1506.06579 [cs.CV].
  120. "Toronto startup has a faster way to discover effective medicines". The Globe and Mail. Retrieved 2015-11-09.
  121. "Startup Harnesses Supercomputers to Seek Cures". KQED Future of You (in English). 2015-05-27. Retrieved 2015-11-09.
  122. Tim Pyrkov; Konstantin Slipensky; Mikhail Barg; Alexey Kondrashin; Boris Zhurov; Alexander Zenin; Mikhail Pyatnitskiy; Leonid Menshikov; Sergei Markov; Peter O. Fedichev (2018). "Extracting biological age from biomedical data via deep learning: too much of a good thing?". Scientific Reports. 8 (1): 5210. Bibcode:2018NatSR...8.5210P. doi:10.1038/s41598-018-23534-9. PMC 5980076. PMID 29581467.
  123. Chellapilla, K; Fogel, DB (1999). "Evolving neural networks to play checkers without relying on expert knowledge". IEEE Trans Neural Netw. 10 (6): 1382–91. doi:10.1109/72.809083. PMID 18252639.
  124. Chellapilla, K.; Fogel, D.B. (2001). "Evolving an expert checkers playing program without using human expertise". IEEE Transactions on Evolutionary Computation. 5 (4): 422–428. doi:10.1109/4235.942536.
  125. Fogel, David (2001). Blondie24: Playing at the Edge of AI. San Francisco, CA: Morgan Kaufmann. ISBN 978-1558607835. 
  126. Clark, Christopher; Storkey, Amos (2014). "Teaching Deep Convolutional Neural Networks to Play Go". arXiv:1412.3409 [cs.AI].
  127. Maddison, Chris J.; Huang, Aja; Sutskever, Ilya; Silver, David (2014). "Move Evaluation in Go Using Deep Convolutional Neural Networks". arXiv:1412.6564 [cs.LG].
  128. "AlphaGo – Google DeepMind". Archived from the original on 30 January 2016. Retrieved 30 January 2016.
  129. Bai, Shaojie; Kolter, J. Zico; Koltun, Vladlen (2018-04-19). "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling". arXiv:1803.01271 [cs.LG].
  130. Yu, Fisher; Koltun, Vladlen (2016-04-30). "Multi-Scale Context Aggregation by Dilated Convolutions". arXiv:1511.07122 [cs.CV].
  131. Borovykh, Anastasia; Bohte, Sander; Oosterlee, Cornelis W. (2018-09-17). "Conditional Time Series Forecasting with Convolutional Neural Networks". arXiv:1703.04691 [stat.ML].
  132. Mittelman, Roni (2015-08-03). "Time-series modeling with undecimated fully convolutional neural networks". arXiv:1508.00317 [stat.ML].
  133. Chen, Yitian; Kang, Yanfei; Chen, Yixiong; Wang, Zizhuo (2019-06-11). "Probabilistic Forecasting with Temporal Convolutional Neural Network". arXiv:1906.04397 [stat.ML].
  134. Zhao, Bendong; Lu, Huanzhang; Chen, Shangfeng; Liu, Junliang; Wu, Dongya (2017-02-01). "Convolutional neural networks for time series classi". Journal of Systems Engineering and Electronics. 28 (1): 162–169. doi:10.21629/JSEE.2017.01.18.
  135. Petneházi, Gábor (2019-08-21). "QCNN: Quantile Convolutional Neural Network". arXiv:1908.07978 [cs.LG].
  136. Hubert Mara (2019-06-07), HeiCuBeDa Hilprecht – Heidelberg Cuneiform Benchmark Dataset for the Hilprecht Collection (in Deutsch), heiDATA – institutional repository for research data of Heidelberg University, doi:10.11588/data/IE8CCN
  137. Hubert Mara and Bartosz Bogacz (2019), "Breaking the Code on Broken Tablets: The Learning Challenge for Annotated Cuneiform Script in Normalized 2D and 3D Datasets", Proceedings of the 15th International Conference on Document Analysis and Recognition (ICDAR) (in Deutsch), Sydney, Australien, pp. 148–153, doi:10.1109/ICDAR.2019.00032, ISBN 978-1-7281-3014-9, S2CID 211026941
  138. Bogacz, Bartosz; Mara, Hubert (2020), "Period Classification of 3D Cuneiform Tablets with Geometric Neural Networks", Proceedings of the 17th International Conference on Frontiers of Handwriting Recognition (ICFHR), Dortmund, Germany
  139. Presentation of the ICFHR paper on Period Classification of 3D Cuneiform Tablets with Geometric Neural Networks on YouTube
  140. Durjoy Sen Maitra; Ujjwal Bhattacharya; S.K. Parui, "CNN based common approach to handwritten character recognition of multiple scripts", in Document Analysis and Recognition (ICDAR), 2015 13th International Conference on, vol., no., pp.1021–1025, 23–26 Aug. 2015
  141. "NIPS 2017". Interpretable ML Symposium. 2017-10-20. Retrieved 2018-09-12.
  142. Zang, Jinliang; Wang, Le; Liu, Ziyi; Zhang, Qilin; Hua, Gang; Zheng, Nanning (2018). "Attention-Based Temporal Weighted Convolutional Neural Network for Action Recognition". IFIP Advances in Information and Communication Technology. Cham: Springer International Publishing. pp. 97–108. arXiv:1803.07179. doi:10.1007/978-3-319-92007-8_9. ISBN 978-3-319-92006-1. ISSN 1868-4238. 
  143. Wang, Le; Zang, Jinliang; Zhang, Qilin; Niu, Zhenxing; Hua, Gang; Zheng, Nanning (2018-06-21). "Action Recognition by an Attention-Aware Temporal Weighted Convolutional Neural Network" (PDF). Sensors. 18 (7): 1979. Bibcode:2018Senso..18.1979W. doi:10.3390/s18071979. ISSN 1424-8220. PMC 6069475. PMID 29933555.
  144. Ong, Hao Yi; Chavez, Kevin; Hong, Augustus (2015-08-18). "Distributed Deep Q-Learning". arXiv:1508.04186v2 [cs.LG].
  145. Mnih, Volodymyr; et al. (2015). "Human-level control through deep reinforcement learning". Nature. 518 (7540): 529–533. Bibcode:2015Natur.518..529M. doi:10.1038/nature14236. PMID 25719670. S2CID 205242740.
  146. Sun, R.; Sessions, C. (June 2000). "Self-segmentation of sequences: automatic formation of hierarchies of sequential behaviors". IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics. 30 (3): 403–418. CiteSeerX 10.1.1.11.226. doi:10.1109/3477.846230. ISSN 1083-4419. PMID 18252373.
  147. "Convolutional Deep Belief Networks on CIFAR-10" (PDF).
  148. Lee, Honglak; Grosse, Roger; Ranganath, Rajesh; Ng, Andrew Y. (1 January 2009). Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations. ACM. pp. 609–616. doi:10.1145/1553374.1553453. ISBN 9781605585161. 
  149. Cade Metz (May 18, 2016). "Google Built Its Very Own Chips to Power Its AI Bots". Wired.

External links

  • CS231n: Convolutional Neural Networks for Visual Recognition — Andrej Karpathy's Stanford computer science course on CNNs in computer vision
  • An Intuitive Explanation of Convolutional Neural Networks — A beginner level introduction to what Convolutional Neural Networks are and how they work
  • Convolutional Neural Networks for Image Classification — Literature Survey

= 外部链接 =

  • CS231n: 用于视觉识别的卷积神经网络ー安德烈 · 卡帕西的斯坦福计算机科学课程,关于计算机视觉中的 CNN
  • 卷积神经网络的直观解释ー初学者级别介绍什么是卷积神经网络及其工作原理
  • 用于图像分类的卷积神经网络ー文献综述

Category:Artificial neural networks Category:Computer vision Category:Computational neuroscience Category:Machine learning

分类: 人工神经网络分类: 计算机视觉分类: 计算神经科学分类: 机器学习


This page was moved from wikipedia:en:Convolutional neural network. Its edit history can be viewed at 卷积神经网络/edithistory