Deep Networks : Overview

Overview

In the previous sections, you constructed a 3-layer neural network comprising an input, hidden and output layer. While fairly effective for MNIST, this 3-layer model is a fairly shallow network; by this, we mean that the features (hidden layer activations a⁽²⁾) are computed using only "one layer" of computation (the hidden layer).

In this section, we begin to discuss deep neural networks, meaning ones in which we have multiple hidden layers; this will allow us to compute much more complex features of the input. Because each hidden layer computes a non-linear transformation of the previous layer, a deep network can have significantly greater representational power (i.e., can learn significantly more complex functions) than a shallow one.

Note that when training a deep network, it is important to use a non-linear activation function in each hidden layer. This is because multiple layers of linear functions would itself compute only a linear function of the input (i.e., composing multiple linear functions together results in just another linear function), and thus be no more expressive than using just a single layer of hidden units.

Advantages of deep networks

The primary advantage is that it can compactly represent a significantly larger set of fuctions than shallow networks. Formally, one can show that there are functions which a k-layer network can represent compactly (with a number of hidden units that is polynomial in the number of inputs), that a (k − 1)-layer network cannot represent unless it has an exponentially large number of hidden units.

By using a deep network, in the case of images, one can also start to learn part-whole decompositions. For example, the first layer might learn to group together pixels in an image in order to detect edges (as seen in the earlier exercises). The second layer might then group together edges to detect longer contours, or perhaps detect simple "parts of objects." An even deeper layer might then group together these contours or detect even more complex features.

Finally, cortical computations (in the brain) also have multiple layers of processing. For example, visual images are processed in multiple stages by the brain, by cortical area "V1", followed by cortical area "V2" (a different part of the brain), and so on.

Difficulty of training deep architectures

The main learning algorithm that researchers were using was to randomly initialize the weights of a deep network, and then train it using a labeled training set using a supervised learning objective, for example by applying gradient descent to try to drive down the training error. However, this usually did not work well. There were several reasons for this.

Availability of data

Local optima

Diffusion of gradients

when using backpropagation to compute the derivatives, the gradients that are propagated backwards (from the output layer to the earlier layers of the network) rapidly diminish in magnitude as the depth of the network increases. As a result, the derivative of the overall cost with respect to the weights in the earlier layers is very small.（深度神经网络的前几层） Thus, when using gradient descent, the weights of the earlier layers change slowly, and the earlier layers fail to learn much. This problem is often called the "diffusion of gradients."

A closely related problem to the diffusion of gradients is that if the last few layers in a neural network have a large enough number of neurons, it may be possible for them to model the labeled data alone without the help of the earlier layers. Hence, training the entire network at once with all the layers randomly initialized ends up giving similar performance to training a shallow network (the last few layers) on corrupted input (the result of the processing done by the earlier layers).

Greedy layer-wise training

the main idea is to train the layers of the network one at a time, so that we first train a network with 1 hidden layer, and only after that is done, train a network with 2 hidden layers, and so on. At each step, we take the old network with k − 1 hidden layers, and add an additional k-th hidden layer (that takes as input the previous hidden layer k − 1 that we had just trained). Training can either be supervised (say, with classification error as the objective function on each step), but more frequently it is unsupervised (as in an autoencoder; details to provided later). The weights from training the layers individually are then used to initialize the weights in the final/overall deep network, and only then is the entire architecture "fine-tuned" (i.e., trained together to optimize the labeled training set error).

Availability of data

While labeled data can be expensive to obtain, unlabeled data is cheap and plentiful. The promise of self-taught learning is that by exploiting the massive amount of unlabeled data, we can learn much better models. By using unlabeled data to learn a good initial value for the weights in all the layers (except for the final classification layer that maps to the outputs/predictions), our algorithm is able to learn and discover patterns from massively more amounts of data than purely supervised approaches. This often results in much better classifiers being learned.

Better local optima

After having trained the network on the unlabeled data, the weights are now starting at a better location in parameter space than if they had been randomly initialized. We can then further fine-tune the weights starting from this location. Empirically, it turns out that gradient descent from this location is much more likely to lead to a good local minimum, because the unlabeled data has already provided a significant amount of "prior" information about what patterns there are in the input data.

时间： 2024-10-12 12:55:25

Deep Networks : Overview的相关文章

【DeepLearning】Exercise: Implement deep networks for digit classification

Exercise: Implement deep networks for digit classification 习题链接:Exercise: Implement deep networks for digit classification stackedAEPredict.m function [pred] = stackedAEPredict(theta, inputSize, hiddenSize, numClasses, netconfig, data) % stackedAEPre

Initialization of deep networks

Initialization of deep networks 24 Feb 2015Gustav Larsson As we all know, the solution to a non-convex optimization algorithm (like stochastic gradient descent) depends on the initial values of the parameters. This post is about choosing initializati

Note_Automatic Water-Body Segmentation From High-Resolution Satellite Images via Deep Networks

基本信息二区文章,水域分割 Automatic Water-Body Segmentation From High-Resolution Satellite Images via Deep Networks 笔记出发点水域分割是遥感的基本任务. 传统的方法依赖光谱,只能处理分辨率低图像.而分辨率强的图片,包含更多细节. 不同数据传感器获得数据,方法的鲁棒性得到考验. 主要的创新点提出一个新的分割网络RRF DeconvNet 网络(restricted receptive field d

Deep Learning Overview

[Ref: http://en.wikipedia.org/wiki/Deep_learning] Definition: a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures, with complex structures or otherwise, composed

【转帖】UFLDL Tutorial（the main ideas of Unsupervised Feature Learning and Deep Learning）

UFLDL Tutorial From Ufldl Jump to: navigation, search Description: This tutorial will teach you the main ideas of Unsupervised Feature Learning and Deep Learning. By working through it, you will also get to implement several feature learning/deep le

【转载】A Brief Overview of Deep Learning

A Brief Overview of Deep Learning (This is a guest post by Ilya Sutskever on the intuition behind deep learning as well as some very useful practical advice. Many thanks to Ilya for such a heroic effort!) Deep Learning is really popular these days. B

A Brief Overview of Deep Learning

Training Deep Neural Networks

http://handong1587.github.io/deep_learning/2015/10/09/training-dnn.html //转载于 Training Deep Neural Networks Published: 09 Oct 2015 Category: deep_learning Tutorials Popular Training Approaches of DNNs?—?A Quick Overview https://medium.com/@asjad/po

Deep Neural Networks的Tricks

Here we will introduce these extensive implementation details, i.e., tricks or tips, for building and training your own deep networks. 主要以下面八个部分展开介绍: mainly in eight aspects: 1) data augmentation; 2) pre-processing on images; 3) initializations of Ne