Initialization of deep networks

Initialization of deep networks

24 Feb 2015Gustav Larsson

As we all know, the solution to a non-convex optimization algorithm (like stochastic gradient descent) depends on the initial values of the parameters. This post is about choosing initialization parameters for deep networks and how it affects the convergence. We will also discuss the related topic of vanishing gradients.

First, let‘s go back to the time of sigmoidal activation functions and initialization of parameters using IID Gaussian or uniform distributions with fairly arbitrarily set variances. Building deep networks was difficult because of exploding or vanishing activations and gradients. Let‘s take activations first: If all your parameters are too small, the variance of your activations will drop in each layer. This is a problem if your activation function is sigmoidal, since it is approximately linear close to 0. That is, you gradually lose your non-linearity, which means there is no benefit to having multiple layers. If, on the other hand, your activations become larger and larger, then your activations will saturate and become meaningless, with gradients approaching 0.

Let us consider one layer and forget about the bias. Note that the following analysis and conclussion is taken from Glorot and Bengio[1]. Consider a weight matrix W∈Rm×n, where each element was drawn from an IID Guassian with variance Var(W). Note that we are a bit abusive with notation letting W denote both a matrix and a univariate random variable. We also assume there is no correlation between our input and our weights and both are zero-mean. If we consider one filter (row) in W, say w (a random vector), then the variance of the output signal over the input signal is:

Var(wTx)Var(X)=∑NnVar(wnxn)Var(X)=nVar(W)Var(X)Var(X)=nVar(W)

As we build a deep network, we want the variance of the signal going forward in the network to remain the same, thus it would be advantageous if nVar(W)=1. The same argument can be made for the gradients, the signal going backward in the network, and the conclusion is that we would also like mVar(W)=1. Unless n=m, it is impossible to sastify both of these conditions. In practice, it works well if both are approximately satisfied. One thing that has never been clear to me is why it is only necessary to satisfy these conditions when picking the initialization values of W. It would seem that we have no guarantee that the conditions will remain true as the network is trained.

Nevertheless, this Xavier initialization (after Glorot‘s first name) is a neat trick that works well in practice. However, along came rectified linear units (ReLU), a non-linearity that is scale-invariant around 0 and does not saturate at large input values. This seemingly solved both of the problems the sigmoid function had; or were they just alleviated? I am unsure of how widely used Xavier initialization is, but if it is not, perhaps it is because ReLU seemingly eliminated this problem.

However, take the most competative network as of recently, VGG[2]. They do not use this kind of initialization, although they report that it was tricky to get their networks to converge. They say that they first trained their most shallow architecture and then used that to help initialize the second one, and so forth. They presented 6 networks, so it seems like an awfully complicated training process to get to the deepest one.

A recent paper by He et al.[3] presents a pretty straightforward generalization of ReLU and Leaky ReLU. What is more interesting is their emphasis on the benefits of Xavier initialization even for ReLU. They re-did the derivations for ReLUs and discovered that the conditions were the same up to a factor 2. The difficulty Simonyan and Zisserman had training VGG is apparently avoidable, simply by using Xavier intialization (or better yet the ReLU adjusted version). Using this technique, He et al. reportedly trained a whopping 30-layer deep network to convergence in one go.

Another recent paper tackling the signal scaling problem is by Ioffe and Szegedy[4]. They call the change in scale internal covariate shift and claim this forces learning rates to be unnecessarily small. They suggest that if all layers have the same scale and remain so throughout training, a much higher learning rate becomes practically viable. You cannot just standardize the signals, since you would lose expressive power (the bias disappears and in the case of sigmoids we would be constrained to the linear regime). They solve this by re-introducing two parameters per layer, scaling and bias, added again after standardization. The training reportedly becomes about 6 times faster and they present state-of-the-art results on ImageNet. However, I‘m not certain this is the solution that will stick.

I reckon we will see a lot more work on this frontier in the next few years. Especially since it also relates to the -- right now wildly popular -- Recurrent Neural Network (RNN), which connects output signals back as inputs. The way you train such network is that you unroll the time axis, treating the result as an extremely deep feedforward network. This greatly exacerbates the vanishing gradient problem. A popular solution, called Long Short-Term Memory (LSTM), is to introduce memory cells, which are a type of teleport that allows a signal to jump ahead many time steps. This means that the gradient is retained for all those time steps and can be propagated back to a much earlier time without vanishing.

This area is far from solved, and until then I think I will be sticking to Xavier initialization. If you are using Caffe, the one take-away of this post is to use the following on all your layers:

weight_filler {
    type: "xavier"
}

References

  1. X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in International conference on artificial intelligence and statistics, 2010, pp. 249–256.
  2. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [pdf]
  3. K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” arXiv:1502.01852 [cs], Feb. 2015. [pdf]
  4. S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” arXiv:1502.03167 [cs], Feb. 2015. [pdf]

Related Posts

时间: 2024-08-27 19:10:58

Initialization of deep networks的相关文章

【DeepLearning】Exercise: Implement deep networks for digit classification

Exercise: Implement deep networks for digit classification 习题链接:Exercise: Implement deep networks for digit classification stackedAEPredict.m function [pred] = stackedAEPredict(theta, inputSize, hiddenSize, numClasses, netconfig, data) % stackedAEPre

Deep Networks : Overview

Overview In the previous sections, you constructed a 3-layer neural network comprising an input, hidden and output layer. While fairly effective for MNIST, this 3-layer model is a fairly shallow network; by this, we mean that the features (hidden lay

Note_Automatic Water-Body Segmentation From High-Resolution Satellite Images via Deep Networks

基本信息 二区文章,水域分割 Automatic Water-Body Segmentation From High-Resolution Satellite Images via Deep Networks 笔记 出发点 水域分割是遥感的基本任务. 传统的方法依赖光谱,只能处理分辨率低图像.而分辨率强的图片,包含更多细节. 不同数据传感器获得数据,方法的鲁棒性得到考验. 主要的创新点 提出一个新的分割网络RRF DeconvNet 网络(restricted receptive field d

总结:Different Methods for Weight Initialization in Deep Learning

这里总结了三种权重的初始化方法,前两种比较常见,后一种是最新的.为了表达顺畅(当时写给一个歪果仁看的),用了英文,欢迎补充和指正. 尊重原创,转载请注明:http://blog.csdn.net/tangwei2014 1. Gaussian Weights are randomly drawn from Gaussian distributions with fixed mean (e.g., 0) and fixed standard deviation (e.g., 0.01). This

Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization - week1

Normalizing input Vanishing/Exploding gradients deep neural network suffer from these issues. they are huge barrier to training deep neural network. There is a partial solution to solve the above problem but help a lot which is careful choice how you

[C4] Andrew Ng - Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

About this Course This course will teach you the "magic" of getting deep learning to work well. Rather than the deep learning process being a black box, you will understand what drives performance, and be able to more systematically get good res

深度学习性能提高

性能提高分为四个部分: 1. 通过数据提升性能 2. 通过算法提升性能 3. 通过算法调参提升性能 4. 通过嵌套模型提升性能 通常来讲,随着列表自上而下,性能的提升也将变小.例如,对问题进行新的架构或者获取更多的数据,通常比调整最优算法的参数能带来更好的效果.虽然并不总是这样,但是通常来讲是的. 1. 通过数据提升性能 对你的训练数据和问题定义进行适当改变,你能得到很大的性能提升.或许是最大的性能提升. 以下是我将要提到的思路: 获取更多数据 创造更多数据 重放缩你的数据 转换你的数据 特征选

吴恩达-深度学习-课程笔记-6: 深度学习的实用层面( Week 1 )

1 训练/验证/测试集( Train/Dev/test sets ) 构建神经网络的时候有些参数需要选择,比如层数,单元数,学习率,激活函数.这些参数可以通过在验证集上的表现好坏来进行选择. 前几年机器学习普遍的做法: 把数据分成60%训练集,20%验证集,20%测试集.如果有指明的测试集,那就用把数据分成70%训练集,30%验证集. 现在数据量大了,那么验证集和数据集的比例会变小.比如我们有100w的数据,取1w条数据来评估就可以了,取1w做验证集,1w做测试集,剩下的用来训练,即98%的训练

Training Deep Neural Networks

http://handong1587.github.io/deep_learning/2015/10/09/training-dnn.html  //转载于 Training Deep Neural Networks Published: 09 Oct 2015  Category: deep_learning Tutorials Popular Training Approaches of DNNs?—?A Quick Overview https://medium.com/@asjad/po