Sparse Autoencoder（一）

Neural Networks

We will use the following diagram to denote a single neuron:

This "neuron" is a computational unit that takes as input x₁,x₂,x₃ (and a +1 intercept term), and outputs , where is called the activation function. In these notes, we will choose to be the sigmoid function:

Thus, our single neuron corresponds exactly to the input-output mapping defined by logistic regression.

Although these notes will use the sigmoid function, it is worth noting that another common choice for f is the hyperbolic tangent, or tanh, function:

Here are plots of the sigmoid and tanh functions:

Finally, one identity that‘ll be useful later: If f(z) = 1 / (1 + exp( − z)) is the sigmoid function, then its derivative is given by f‘(z) = f(z)(1 − f(z))

sigmoid 函数或 tanh 函数都可用来完成非线性映射

Neural Network model

A neural network is put together by hooking together many of our simple "neurons," so that the output of a neuron can be the input of another. For example, here is a small neural network:

In this figure, we have used circles to also denote the inputs to the network. The circles labeled "+1" are called bias units, and correspond to the intercept term. The leftmost layer of the network is called the input layer, and the rightmost layer the output layer (which, in this example, has only one node). The middle layer of nodes is called the hidden layer, because its values are not observed in the training set. We also say that our example neural network has 3 input units (not counting the bias unit), 3 hidden units, and 1 output unit.

Our neural network has parameters (W,b) = (W⁽¹⁾,b⁽¹⁾,W⁽²⁾,b⁽²⁾), where we write to denote the parameter (or weight) associated with the connection between unit j in layer l, and unit i in layerl + 1. (Note the order of the indices.) Also, is the bias associated with unit i in layer l + 1.

We will write to denote the activation (meaning output value) of unit i in layer l. For l = 1, we also use to denote the i-th input. Given a fixed setting of the parameters W,b, our neural network defines a hypothesis h_W,b(x) that outputs a real number. Specifically, the computation that this neural network represents is given by:

每层都是线性组合 + 非线性映射

In the sequel, we also let denote the total weighted sum of inputs to unit i in layer l, including the bias term (e.g., ), so that .

Note that this easily lends itself to a more compact notation. Specifically, if we extend the activation function to apply to vectors in an element-wise fashion (i.e., f([z₁,z₂,z₃]) = [f(z₁),f(z₂),f(z₃)]), then we can write the equations above more compactly as:

We call this step forward propagation.

Backpropagation Algorithm

for a single training example (x,y), we define the cost function with respect to that single example to be:

This is a (one-half) squared-error cost function. Given a training set of m examples, we then define the overall cost function to be:

J(W,b;x,y) is the squared error cost with respect to a single example; J(W,b) is the overall cost function, which includes the weight decay term.

Our goal is to minimize J(W,b) as a function of W and b. To train our neural network, we will initialize each parameter and each to a small random value near zero (say according to a Normal(0,ε²) distribution for some small ε, say 0.01), and then apply an optimization algorithm such as batch gradient descent.Finally, note that it is important to initialize the parameters randomly, rather than to all 0‘s. If all the parameters start off at identical values, then all the hidden layer units will end up learning the same function of the input (more formally, will be the same for all values of i, so that for any input x). The random initialization serves the purpose of symmetry breaking.

One iteration of gradient descent updates the parameters W,b as follows:

The two lines above differ slightly because weight decay is applied to W but not b.

The intuition behind the backpropagation algorithm is as follows. Given a training example (x,y), we will first run a "forward pass" to compute all the activations throughout the network, including the output value of the hypothesis h_W,b(x). Then, for each node i in layer l, we would like to compute an "error term" that measures how much that node was "responsible" for any errors in our output.

For an output node, we can directly measure the difference between the network‘s activation and the true target value, and use that to define (where layer n_l is the output layer). For hidden units, we will compute based on a weighted average of the error terms of the nodes that uses as an input. In detail, here is the backpropagation algorithm:

1，Perform a feedforward pass, computing the activations for layers L₂, L₃, and so on up to the output layer .

2，For each output unit i in layer n_l (the output layer), set

For

For each node i in layer l, set

4，Compute the desired partial derivatives, which are given as:

We will use "" to denote the element-wise product operator (denoted ".*" in Matlab or Octave, and also called the Hadamard product), so that if , then . Similar to how we extended the definition of to apply element-wise to vectors, we also do the same for (so that ).

The algorithm can then be written:

1，Perform a feedforward pass, computing the activations for layers , , up to the output layer , using the equations defining the forward propagation steps

2，For the output layer (layer ), set

3，For

Set

4，Compute the desired partial derivatives:

Implementation note: In steps 2 and 3 above, we need to compute for each value of . Assuming is the sigmoid activation function, we would already have stored away from the forward pass through the network. Thus, using the expression that we worked out earlier for , we can compute this as .

Finally, we are ready to describe the full gradient descent algorithm. In the pseudo-code below, is a matrix (of the same dimension as ), and is a vector (of the same dimension as ). Note that in this notation, "" is a matrix, and in particular it isn‘t " times ." We implement one iteration of batch gradient descent as follows:

1，Set , (matrix/vector of zeros) for all .

2，For to ,

Use backpropagation to compute and .

Set .

Set .

3，Update the parameters:

时间： 2024-10-13 08:09:26

Sparse Autoencoder（一）的相关文章

UFLDL实验报告2：Sparse Autoencoder

Sparse Autoencoder稀疏自编码器实验报告 1.Sparse Autoencoder稀疏自编码器实验描述自编码神经网络是一种无监督学习算法,它使用了反向传播算法,并让目标值等于输入值,比如 .自编码神经网络尝试学习一个的函数.换句话说,它尝试逼近一个恒等函数,从而使得输出接近于输入 .当我们为自编码神经网络加入某些限制,比如给隐藏神经元加入稀疏性限制,那么自编码神经网络即使在隐藏神经元数量较多的情况下仍然可以发现输入数据中一些有趣的结构.稀疏性可以被简单地解释如下.如果当神经

【转帖】Andrew ng 【Sparse Autoencoder 】@UFLDL Tutorial

Neural Networks From Ufldl Jump to: navigation, search Consider a supervised learning problem where we have access to labeled training examples (x(i),y(i)). Neural networks give a way of defining a complex, non-linear form of hypotheses hW,b(x), wit

七、Sparse Autoencoder介绍

目前为止,我们已经讨论了神经网络在有监督学习中的应用.在有监督学习中,训练样本是有类别标签的.现在假设我们只有一个没有带类别标签的训练样本集合 ,其中 .自编码神经网络是一种无监督学习算法,它使用了反向传播算法,并让目标值等于输入值,比如 .下图是一个自编码神经网络的示例. 自编码神经网络尝试学习一个的函数.换句话说,它尝试逼近一个恒等函数,从而使得输出接近于输入 .恒等函数虽然看上去不太有学习的意义,但是当我们为自编码神经网络加入某些限制,比如限定隐藏神经元的数量,我们就可以从

【DeepLearning】Exercise:Sparse Autoencoder

习题的链接:http://deeplearning.stanford.edu/wiki/index.php/Exercise:Sparse_Autoencoder 我的实现: sampleIMAGES.m function patches = sampleIMAGES() % sampleIMAGES % Returns 10000 patches for training load IMAGES; % load images from disk patchsize = 8; % we'll u

Exercise:Sparse Autoencoder

斯坦福deep learning教程中的自稀疏编码器的练习,主要是参考了 http://www.cnblogs.com/tornadomeet/archive/2013/03/20/2970724.html,没有参考肯定编不出来...Σ( ° △ °|||)︴也当自己理解了一下这里的自稀疏编码器,练习上规定是64个输入节点,25个隐藏层节点(我实验中只有20个),输出层也是64个节点,一共有10000个训练样本具体步骤: 首先在页面上下载sparseae_exercise.zip S

sparse autoencoder

1.autoencoder autoencoder的目标是通过学习函数,获得其隐藏层作为学习到的新特征. 从L1到L2的过程成为解构,从L2到L3的过程称为重构. 每一层的输出使用sigmoid方法,因为其输出介于0和1之间,所以需要对输入进行正规化使用差的平方作为损失函数 2.sparse spare的含义是,要求隐藏层每次只有少数的神经元被激活: 隐藏层的输出a,a接近于0,称为未激活 a接近1,成为激活使用如下方法衡量: 每个隐藏层的神经元有p的概率为激活,1-p的概率未激活(p一般取

Sparse Autoencoder（二）

Gradient checking and advanced optimization In this section, we describe a method for numerically checking the derivatives computed by your code to make sure that your implementation is correct. Carrying out the derivative checking procedure describe

Sparse Autoencoder稀疏自动编码

本系列文章都是关于UFLDL Tutorial的学习笔记 Neural Networks 对于一个有监督的学习问题,训练样本输入形式为(x(i),y(i)).使用神经网络我们可以找到一个复杂的非线性的假设h(x(i))可以拟合我们的数据y(i).我们先观察一个神经元的机制: 每个神经元是一个计算单元,输入为x1,x2,x3,输出为: 其中f()是激活函数,常用的激活函数是S函数: S函数的形状如下,它有一个很好的性质就是导数很方便求:f'(z) = f(z)(1 ? f(z)): 还有一个常见的

理解sparse coding

理解sparse coding 稀疏编码系列: (一)----Spatial Pyramid 小结 (二)----图像的稀疏表示——ScSPM和LLC的总结 (三)----理解sparse coding (四)----稀疏模型与结构性稀疏模型 --------------------------------------------------------------------------- 本文的内容主要来自余凯老师在CVPR2012上给的Tutorial.前面在总结ScSPM和LLC的时候,