Machine Learning - Neural Networks Learning: Cost Function and Backpropagation

This series of articles are the study notes of " Machine Learning ", by Prof. Andrew Ng., Stanford University. This article is the notes of week 5, Neural Networks Learning. This article contains some topic about Cost Function and Backpropagation algorithm.

Cost Function and Backpropagation

Neural networks are one of the most powerful learning algorithms that we have today. In this and in the next few sections, We‘re going to start talking about a learning algorithm for fitting the parameters of a neural network given
a training set.As with the discussion of most of our learning algorithms, we‘re going to begin by talking about the cost function for fitting the parameters of the network.

1. Cost function

I‘m going to focus on the application of neural networks to classification problems. So suppose we have a network like that shown in the picture. And suppose we have a training set like this is x(i) , y(i) pairs of M training example.

L = total no. of layers in network, L = 4.

sl = no. of units (not counting bias unit) in layer l,s1 = 3, s2 = 5,s4 = sL = 4

Binary classification

The first is Binary classification, where the labels y are either 0 or 1. In this case, we will have 1 output unit, so this Neural Network unit on top has 4 output units, but if we had binary classification we would have only
one output unit that computes h(x). And the output of the neural network would be h(x) is going to be a real number.

y = 0 or 1

Multi-class classification (K classes)

K output units

Cost function

Logistic regression

The cost function we use for the neural network is going to be a generalization of the one that we use for  logistic regression. For logistic regression we used to minimize the cost function J(θ) that was minus 1/m of this cost
function and then plus this extra regularization term here, where this was a sum from J=1 through n, because we did not regularize the bias termθ0.

Neural network

For a neural network, our cost function is going to be a generalization of this. Where instead of having basically just one, which is the compression output unit, we may instead have K of them. So here‘s our cost function.

Our new network now outputs vectors in  RK where K might be equal to 1 if we have a binary classification problem. I‘m going to use this notation h(x) subscript i to denote the ith output. That is, h(x) is a k-dimensional
vector and so this subscript i just selects out the ith element of the vector that is output by my neural network. My cost function J(θ) is now going to be the following.

Is - 1/m of a sum of a similar term to what we have for logistic regression, except that we have the sum from K equals 1 through K. This summation is basically a sum over my K output. So if I have four output units, that is if
the final layer of my neural network has four output units, then this is a sum from k equals one through four of basically the logistic regression algorithm‘s cost function but summing that cost function over each of my four output units in turn.

And finally, the second term here is the regularization term, similar to what we had for the logistic regression. This summation term looks really complicated, but all it‘s doing is it‘s summing over
these terms θji l for all values of ji and l. Except that we don‘t sum over the terms corresponding to these bias values
like we have for logistic progression.

2. Backpropagation algorithm

In the previous section, we talked about a cost function for the neural network. In this section, let‘s start to talk about an algorithm, for trying to minimize the cost function. In particular, we‘ll talk about the back propagation
algorithm.

Gradient computation

Here‘s the cost function that we wrote down in the previous section. What we‘d like to do is try to find parameters theta to try to minimize J(θ). In order to use either gradient descent or one of the advance
optimization algorithms.

Need code to compute:

What we need to do therefore is to write code that takes this input the parameters theta and computes  j of theta and these partial derivative terms. Remember, that the parameters in the neural network of these things,
theta superscript l subscript  ij, that‘s the real number and so, these are the partial derivative terms we need to compute. In order to compute the cost function  j of theta, we just use this formula up here and so, what I want to do for
the most of this video is focus on talking about how we can compute these partial derivative terms.

Given one training example (
x, y)

Let‘s start by talking about the case of when we have only one training example, our entire training set comprises only one training example which is a pair (x, y).

And let‘s tap through the sequence of calculations we would do with this one training example. The first thing we do is we apply forward propagation in order to compute whether a hypotheses actually outputs given the input.

Forward propagation

So this is our vectorized implementation of forward propagation and it allows us to compute the activation values for all of the neurons in our neural network.

Gradient computation: Back propagation algorithm

Next, in order to compute the derivatives,we‘re going to use an algorithm called back propagation. The intuition of the back propagation algorithm is that for each note we‘re going to compute the term δ superscript subscript jthat‘s
going to somehow represent the error of note jin the layer l.

Intuition:    

For each output unit (layer L = 4)

If you think of delta a and y as vectors then you can also take those and come up with a vectorized implementation of it, which is justδ(4) gets
set as a(4)

Where here, each of these δ(4),a(4) and y, each of these is a vector whose dimension is equal to the number of output units in our network.

What we do next is compute the delta terms for the earlier layers in our network. Here‘s a formula for computingδ(3) isδ(3) is equal to theta 3 transpose
timesδ(4). And this dot times, this is the elementy‘s multiplication operation that we know from MATLAB.

Backpropagation algorithm

Training set:

3. Backpropagation intuition

Backpropagation maybe unfortunately is a less mathematically clean, or less mathematically simple algorithm, compared to linear regression or logistic regression. And I‘ve actually used backpropagation, you know, pretty successfully
for many years. And even today I still don‘t sometimes feel like I have a very good sense of just what it‘s doing, or intuition about what back propagation is doing. If, for those of you that are doing the programming exercises, that will at least mechanically
step you through the different steps of how to implement back prop. So you‘ll be able to get it to work for yourself. And what I want to do in this section is look a little bit more at the mechanical steps of backpropagation, and try to give you a little more
intuition about what the mechanical steps the back prop is doing to hopefully convince you that, you know, it‘s at least a reasonable algorithm.

Forward Propagation

In order to better understand backpropagation, let‘s take another closer look at what forward propagation is doing. Here‘s a neural network with two input units that is not counting the bias unit, and two hidden units in this
layer, and two hidden units in the next layer. And then, finally, one output unit. Again, these counts two, two, two, are not counting these bias units on top.

In order to illustrate forward propagation,I‘m going to draw this network a little bit differently. And in particular I‘m going to draw this neural-network with the nodes drawn as these very fat ellipsis, so that I can write text
in them. When performing forward propagation, we might have some particular example. Say some example (xi, yi) And it‘ll be this xthat we feed into the input layer.

So the way we compute this value, z1(3) is

When we forward propagated to the first hidden layer here,what we do is compute z1(2) and z2(2). So these are the weighted sum of inputs of the input units. And then
we apply the sigmoid of the logistic function, and the sigmoid activation function applied to the z value. Here‘s are the activation values. So that gives us a1(2)and  a2(2) . And then we forward
propagate again to get here z1(3). Apply the sigmoid of the logistic function, the activation function to that to get a1(3). And similarly, like so until we getz1(4).
Apply the activation function. This gives us a1(4), which is the final output value of the neural network.

What is backpropagation doing? 

What backpropagation is doing is doing a process very similar to Forward Propagation. Except that instead of the computations flowing from the left to the right of this network, the computations since their flow from the right
to the left of the network. And using a very similar computation as this.

Cost function of neural network is

Focusing on a single example x(i),y(i), the case of 1 output unit
(K=1), and ignoring regularization (λ=0), the cost function can be written as follows

And what this cost function does is it plays a role similar to the squared arrow. So, rather than looking at this complicated expression, if you want you can think of cost of i being approximately the square difference between
what the neural network outputs, versus what is the actual value.

Think of

i.e.how well is the network doing on example i?

More formally, what the delta terms actually are is this, they‘re the partial derivative with respect to zj(l), that is this weighted sum of inputs that were confusing these z terms.
Partial derivatives with respect to these things of the cost function. So concretely, the cost function is a function of the label y and of the value, this h(x) output value neural network. And if we could go inside
the neural network and just change those zj(l) values a little bit, then that will affect these values that the
neural network is outputting. And that will end up changing the cost function.

We don‘t compute the bias term

And by the way, so far I‘ve been writing the delta values only for the hidden units, but excluding the bias units. Depending on how you define the backpropagation algorithm, or depending on how you implement it, you may end up
implementing something that computes delta values for these bias units as well. The bias units always output the value of plus one, and they are just what they are, and there‘s no way for us to change the value. And so, depending on your implementation of
back prop, the way I usually implement it. I do end up computing these delta values, but we just discard them, we don‘t use them.

时间: 2024-08-04 10:08:13

Machine Learning - Neural Networks Learning: Cost Function and Backpropagation的相关文章

吴恩达机器学习第5周Neural Networks(Cost Function and Backpropagation)

5.1 Cost Function 假设训练样本为:{(x1),y(1)),(x(2),y(2)),...(x(m),y(m))} L  = total no.of layers in network sL= no,of units(not counting bias unit) in layer L K = number of output units/classes 如图所示的神经网络,L = 4,s1 = 3,s2 = 5,s3 = 5, s4 = 4 逻辑回归的代价函数: 神经网络的代价

Machine Learning - Neural Networks Learning: Backpropagation in Practice

This series of articles are the study notes of " Machine Learning ", by Prof. Andrew Ng., Stanford University. This article is the notes of week 5, Neural Networks Learning. This article contains some topic about how to apply Backpropagation alg

Machine Learning - IX. Neural Networks Learning (Week 5)

http://blog.csdn.net/pipisorry/article/details/44119187 机器学习Machine Learning - Andrew NG courses学习笔记 Neural Networks Learning 神经网络学习 Neural Networks are one of the most powerful learning algorithms that we have today. Cost Function代价函数 Note: 对于multi-

#Week7 Neural Networks : Learning

一.Cost Function and Backpropagation 神经网络的损失函数: \[J(\Theta) = - \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[y^{(i)}_k \log ((h_\Theta (x^{(i)}))_k) + (1 - y^{(i)}_k)\log (1 - (h_\Theta(x^{(i)}))_k)\right] + \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1

Stanford机器学习---第五讲. 神经网络的学习 Neural Networks learning

原文见http://blog.csdn.net/abcjennifer/article/details/7758797,加入了一些自己的理解 本栏目(Machine learning)包含单參数的线性回归.多參数的线性回归.Octave Tutorial.Logistic Regression.Regularization.神经网络.机器学习系统设计.SVM(Support Vector Machines 支持向量机).聚类.降维.异常检測.大规模机器学习等章节.全部内容均来自Standford

Coursera 机器学习 第5章 Neural Networks: Learning 学习笔记

5.1节 Cost Function神经网络的代价函数. 上图回顾神经网络中的一些概念: L  神经网络的总层数. sl  第l层的单元数量(不包括偏差单元). 2类分类问题:二元分类和多元分类. 上图展现的是神经网络的损失函数,注意这是正则化的形式. 正则化部分,i.j不为0.当然i.j可以为0,此时的损失函数不会有太大的差异,只是当i.j不为0的形式更为常见. 5.2节 Backpropagation Algorithm最小化损失函数的算法——反向传播算法:找到合适的参数是J(θ)最小. 如

斯坦福大学公开课机器学习: neural networks learning - autonomous driving example(通过神经网络实现自动驾驶实例)

使用神经网络来实现自动驾驶,也就是说使汽车通过学习来自己驾驶. 下图是通过神经网络学习实现自动驾驶的图例讲解: 左下角是汽车所看到的前方的路况图像.左上图,可以看到一条水平的菜单栏(数字4所指示方向),白亮的区段显示的就是人类驾驶者选择的方向.而最右端则对应向右急转的操作(箭头3),中心稍微向左一点的位置(箭头2),则表示在这一点上人类驾驶者的操作是慢慢的向左拐.这幅图的第二部分(箭头5)对应的就是学习算法选出的行驶方向,类似的白亮的区段(箭头6)显示的就是神经网络在这里选择的行驶方向是稍微的左

Neural Networks Learning----- Stanford Machine Learning(by Andrew NG)Course Notes

本栏目内容来自Andrew NG老师的公开课:https://class.coursera.org/ml/class/index 一般而言, 人工神经网络与经典计算方法相比并非优越, 只有当常规方法解决不了或效果不佳时人工神经网络方法才能显示出其优越性.尤其对问题的机理不甚了解或不能用数学模型表示的系统,如故障诊断.特征提取和预测等问题,人工神经网络往往是最有利的工具.另一方面, 人工神经网络对处理大量原始数据而不能用规则或公式描述的问题, 表现出极大的灵活性和自适应性. 神经网络模型解决问题的

提高神经网络的学习方式Improving the way neural networks learn

When a golf player is first learning to play golf, they usually spend most of their time developing a basic swing. Only gradually do they develop other shots, learning to chip, draw and fade the ball, building on and modifying their basic swing. In a