About this Course
This course will teach you the "magic" of getting deep learning to work well. Rather than the deep learning process being a black box, you will understand what drives performance, and be able to more systematically get good results. You will also learn TensorFlow.
After 3 weeks, you will:
- Understand industry best-practices for building deep learning applications.
- Be able to effectively use the common neural network "tricks", including initialization, L2 and dropout regularization, Batch normalization, gradient checking,
- Be able to implement and apply a variety of optimization algorithms, such as mini-batch gradient descent, Momentum, RMSprop and Adam, and check for their convergence.
- Understand new best-practices for the deep learning era of how to set up train/dev/test sets and analyze bias/variance
- Be able to implement a neural network in TensorFlow.
This is the second course of the Deep Learning Specialization.
Practical aspects of Deep Learning
Train / Dev / Test sets - 12m
0:00
Welcome to this course on the practical aspects of deep learning. Perhaps now you‘ve learned how to implement a neural network. In this week you‘ll learn the practical aspects of how to make your neural network work well. Ranging from things like hyperparameter tuning to how to set up your data, to how to make sure your optimization algorithm runs quickly so that you get your learning algorithm to learn in a reasonable time. In this first week we‘ll first talk about the cellular machine learning problem, then we‘ll talk about randomization. And we‘ll talk about some tricks for making sure your neural network implementation is correct. With that, let‘s get started. Making good choices in how you set up your training, development, and test sets can make a huge difference in helping you quickly find a good high performance neural network. When training a neural network you have to make a lot of decisions, such as how many layers will your neural network have? And how many hidden units do you want each layer to have? And what‘s the learning rates? And what are the activation functions you want to use for the different layers? When you‘re starting on a new application, it‘s almost impossible to correctly guess the right values for all of these, and for other hyperparameter choices, on your first attempt. So in practice applied machine learning is a highly iterative process, in which you often start with an idea, such as you want to build a neural network of a certain number of layers, certain number of hidden units, maybe on certain data sets and so on. And then you just have to code it up and try it by running your code. You run and experiment and you get back a result that tells you how well this particular network, or this particular configuration works. And based on the outcome, you might then refine your ideas and change your choices and maybe keep iterating in order to try to find a better and a better neural network.
1:50
Today, deep learning has found great success in a lot of areas. Ranging from natural language processing to computer vision to speech recognition to a lot of applications on also structured data. And structured data includes everything from advertisements to web search, which isn‘t just Internet search engines it‘s also, for example, shopping websites. Already any websites that wants deliver great search results when you enter terms into a search bar. To computer security, to logistics, such as figuring out where to send drivers to pick up and drop off things, to many more. So what I‘m seeing is that sometimes a researcher with a lot of experience in NLP might try to do something in computer vision. Or maybe a researcher with a lot of experience in speech recognition might jump in and try to do something on advertising. Or someone from security might want to jump in and do something on logistics. And what I‘ve seen is that intuitions from one domain or from one application area often do not transfer to other application areas. And the best choices may depend on the amount of data you have, the number of input features you have through your computer configuration and whether you‘re training on GPUs or CPUs. And if so, exactly what configuration of GPUs and CPUs, and many other things. So for a lot of applications I think it‘s almost impossible. Even very experienced deep learning people find it almost impossible to correctly guess the best choice of hyperparameters the very first time. And so today, applied deep learning is a very iterative process where you just have to go around this cycle many times to hopefully find a good choice of network for your application. So one of the things that determine how quickly you can make progress is how efficiently you can go around this cycle. And setting up your data sets well in terms of your train, development and test sets can make you much more efficient at that. So if this is your training data, let‘s draw that as a big box. Then traditionally you might take all the data you have and carve off some portion of it to be your training set. Some portion of it to be your hold-out cross validation set,
4:23
and this is sometimes also called the development set. And for brevity I‘m just going to call this the dev set, but all of these terms mean roughly the same thing. And then you might carve out some final portion of it to be your test set. And so the workflow is that you keep on training algorithms on your training sets. And use your dev set or your hold-out cross validation set to see which of many different models performs best on your dev set. And then after having done this long enough, when you have a final model that you want to evaluate, you can take the best model you have found and evaluate it on your test set. In order to get an unbiased estimate of how well your algorithm is doing. So in the previous era of machine learning, it was common practice to take all your data and split it according to maybe a 70/30% in terms of a people often talk about the 70/30 train test splits. If you don‘t have an explicit dev set or maybe a 60/20/20% split in terms of 60% train, 20% dev and 20% test. And several years ago this was widely considered best practice in machine learning. If you have maybe 100 examples in total, maybe 1000 examples in total, maybe after 10,000 examples. These sorts of ratios were perfectly reasonable rules of thumb. But in the modern big data era, where, for example, you might have a million examples in total, then the trend is that your dev and test sets have been becoming a much smaller percentage of the total. Because remember, the goal of the dev set or the development set is that you‘re going to test different algorithms on it and see which algorithm works better. So the dev set just needs to be big enough for you to evaluate, say, two different algorithm choices or ten different algorithm choices and quickly decide which one is doing better. And you might not need a whole 20% of your data for that. So, for example, if you have a million training examples you might decide that just having 10,000 examples in your dev set is more than enough to evaluate which one or two algorithms does better. And in a similar vein, the main goal of your test set is, given your final classifier, to give you a pretty confident estimate of how well it‘s doing. And again, if you have a million examples maybe you might decide that 10,000 examples is more than enough in order to evaluate a single classifier and give you a good estimate of how well it‘s doing. So in this example where you have a million examples, if you need just 10,000 for your dev and 10,000 for your test, your ratio will be more like this 10,000 is 1% of 1 million so you‘ll have 98% train, 1% dev, 1% test. And I‘ve also seen applications where, if you have even more than a million examples, you might end up with 99.5% train and 0.25% dev, 0.25% test. Or maybe a 0.4% dev, 0.1% test. So just to recap, when setting up your machine learning problem, I‘ll often set it up into a train, dev and test sets, and if you have a relatively small dataset, these traditional ratios might be okay. But if you have a much larger data set, it‘s also fine to set your dev and test sets to be much smaller than your 20% or even 10% of your data. We‘ll give more specific guidelines on the sizes of dev and test sets later in this specialization. One other trend we‘re seeing in the era of modern deep learning is that more and more people train on mismatched train and test distributions. Let‘s say you‘re building an app that lets users upload a lot of pictures and your goal is to find pictures of cats in order to show your users. Maybe all your users are cat lovers. Maybe your training set comes from cat pictures downloaded off the Internet, but your dev and test sets might comprise cat pictures from users using our app. So maybe your training set has a lot of pictures crawled off the Internet but the dev and test sets are pictures uploaded by users. Turns out a lot of webpages have very high resolution, very professional, very nicely framed pictures of cats. But maybe your users are uploading blurrier, lower res images just taken with a cell phone camera in a more casual condition. And so these two distributions of data may be different. The rule of thumb I‘d encourage you to follow in this case is to make sure that the dev and test sets come from the same distribution.
9:23
We‘ll say more about this particular guideline as well, but because you will be using the dev set to evaluate a lot of different models and trying really hard to improve performance on the dev set. It‘s nice if your dev set comes from the same distribution as your test set. But because deep learning algorithms have such a huge hunger for training data, one trend I‘m seeing is that you might use all sorts of creative tactics, such as crawling webpages, in order to acquire a much bigger training set than you would otherwise have. Even if part of the cost of that is then that your training set data might not come from the same distribution as your dev and test sets. But you find that so long as you follow this rule of thumb, that progress in your machine learning algorithm will be faster. And I‘ll give a more detailed explanation for this particular rule of thumb later in the specialization as well. Finally, it might be okay to not have a test set. Remember the goal of the test set is to give you a unbiased estimate of the performance of your final network, of the network that you selected. But if you don‘t need that unbiased estimate, then it might be okay to not have a test set. So what you do, if you have only a dev set but not a test set, is you train on the training set and then you try different model architectures. Evaluate them on the dev set, and then use that to iterate and try to get to a good model. Because you‘ve fit your data to the dev set, this no longer gives you an unbiased estimate of performance. But if you don‘t need one, that might be perfectly fine. In the machine learning world, when you have just a train and a dev set but no separate test set. Most people will call this a training set and they will call the dev set the test set. But what they actually end up doing is using the test set as a hold-out cross validation set. Which maybe isn‘t completely a great use of terminology, because they‘re then overfitting to the test set. So when the team tells you that they have only a train and a test set, I would just be cautious and think, do they really have a train dev set? Because they‘re overfitting to the test set. Culturally, it might be difficult to change some of these team‘s terminology and get them to call it a trained dev set rather than a trained test set. Even though I think calling it a train and development set would be more correct terminology. And this is actually okay practice if you don‘t need a completely unbiased estimate of the performance of your algorithm. So having set up a train dev and test set will allow you to integrate more quickly. It will also allow you to more efficiently measure the bias and variance of your algorithm so you can more efficiently select ways to improve your algorithm. Let‘s start to talk about that in the next video.
Bias / Variance - 8m
0:00
I‘ve noticed that almost all the really good machine learning practitioners tend to be very sophisticated in understanding of Bias and Variance. Bias and Variance is one of those concepts that‘s easily learned but difficult to master. Even if you think you‘ve seen the basic concepts of Bias and Variance, there‘s often more new ones to it than you‘d expect. In the Deep Learning Error, another trend is that there‘s been less discussion of what‘s called the bias-variance trade-off. You might have heard this thing called the bias-variance trade-off. But in Deep Learning Error there‘s less of a trade-off, so we‘d still still solve the bias, we still solve the variance, but we just talk less about the bias-variance trade-off. Let‘s see what this means. Let‘s see the data set that looks like this. If you fit a straight line to the data, maybe get a logistic regression fit to that. This is not a very good fit to the data. And so this is class of a high bias, what we say that this is underfitting the data. On the opposite end, if you fit an incredibly complex classifier, maybe deep neural network, or neural network with all the hidden units, maybe you can fit the data perfectly, but that doesn‘t look like a great fit either. So there‘s a classifier of high variance and this is overfitting the data. And there might be some classifier in between, with a medium level of complexity, that maybe fits it correctly like that. That looks like a much more reasonable fit to the data, so we call that just right. It‘s somewhere in between. So in a 2D example like this, with just two features, X-1 and X-2, you can plot the data and visualize bias and variance. In high dimensional problems, you can‘t plot the data and visualize division boundary. Instead, there are couple of different metrics, that we‘ll look at, to try to understand bias and variance. So continuing our example of cat picture classification, where that‘s a positive example and that‘s a negative example, the two key numbers to look at to understand bias and variance will be the train set error and the dev set or the development set error. So for the sake of argument, let‘s say that you‘re recognizing cats in pictures, is something that people can do nearly perfectly, right? So let‘s say, your training set error is 1% and your dev set error is, for the sake of argument, let‘s say is 11%. So in this example, you‘re doing very well on the training set, but you‘re doing relatively poorly on the development set. So this looks like you might have overfit the training set, that somehow you‘re not generalizing well, to this whole cross-validation set in the development set. And so if you have an example like this, we would say this has high variance. So by looking at the training set error and the development set error, you would be able to render a diagnosis of your algorithm having high variance. Now, let‘s say, that you measure your training set and your dev set error, and you get a different result. Let‘s say, that your training set error is 15%. I‘m writing your training set error in the top row, and your dev set error is 16%. In this case, assuming that humans achieve roughly 0% error, that humans can look at these pictures and just tell if it‘s cat or not, then it looks like the algorithm is not even doing very well on the training set. So if it‘s not even fitting the training data seam that well, then this is underfitting the data. And so this algorithm has high bias. But in contrast, this actually generalizing at a reasonable level to the dev set, whereas performance in the dev set is only 1% worse than performance in the training set. So this algorithm has a problem of high bias, because it was not even fitting the training set. Well, this is similar to the leftmost plots we had on the previous slide. Now, here‘s another example. Let‘s say that you have 15% training set error, so that‘s pretty high bias, but when you evaluate to the dev set it does even worse, maybe it does 30%. In this case, I would diagnose this algorithm as having high bias, because it‘s not doing that well on the training set, and high variance. So this has really the worst of both worlds. And one last example, if you have 0.5 training set error, and 1% dev set error, then maybe our users are quite happy, that you have a cat classifier with only 1%, than just we have low bias and low variance. One subtlety, that I‘ll just briefly mention that we‘ll leave to a later video to discuss in detail, is that this analysis is predicated on the assumption, that human level performance gets nearly 0% error or, more generally, that the optimal error, sometimes called base error, so the base in optimal error is nearly 0%. I don‘t want to go into detail on this in this particular video, but it turns out that if the optimal error or the base error were much higher, say, it were 15%, then if you look at this classifier, 15% is actually perfectly reasonable for training set and you wouldn‘t see it as high bias and also a pretty low variance. So the case of how to analyze bias and variance, when no classifier can do very well, for example, if you have really blurry images, so that even a human or just no system could possibly do very well, then maybe base error is much higher, and then there are some details of how this analysis will change. But leaving aside this subtlety for now, the takeaway is that by looking at your training set error you can get a sense of how well you are fitting, at least the training data, and so that tells you if you have a bias problem. And then looking at how much higher your error goes, when you go from the training set to the dev set, that should give you a sense of how bad is the variance problem, so you‘ll be doing a good job generalizing from a training set to the dev set, that gives you sense of your variance. All this is under the assumption that the base error is quite small and that your training and your dev sets are drawn from the same distribution. If those assumptions are violated, there‘s a more sophisticated analysis you could do, which we‘ll talk about in the later video. Now, on the previous slide, you saw what high bias, high variance looks like, and I guess you have the sense of what it a good class can look like. What does high bias and high variance looks like? This is kind of the worst of both worlds. So you remember, we said that a classifier like this, then your classifier has high bias, because it underfits the data. So this would be a classifier that is mostly linear and therefore underfits the data, we‘re drawing this is purple. But if somehow your classifier does some weird things, then it is actually overfitting parts of the data as well. So the classifier that I drew in purple, has both high bias and high variance. Where it has high bias, because, by being a mostly linear classifier, is just not fitting. You know, this quadratic line shape that well, but by having too much flexibility in the middle, it somehow gets this example, and this example overfits those two examples as well. So this classifier kind of has high bias because it was mostly linear, but you need maybe a curve function or quadratic function. And it has high variance, because it had too much flexibility to fit those two mislabel, or those live examples in the middle as well. In case this seems contrived, well, this example is a little bit contrived in two dimensions, but with very high dimensional inputs. You actually do get things with high bias in some regions and high variance in some regions, and so it is possible to get classifiers like this in high dimensional inputs that seem less contrived. So to summarize, you‘ve seen how by looking at your algorithm‘s error on the training set and your algorithm‘s error on the dev set you can try to diagnose, whether it has problems of high bias or high variance, or maybe both, or maybe neither. And depending on whether your algorithm suffers from bias or variance, it turns out that there are different things you could try. So in the next video, I want to present to you, what I call a basic recipe for Machine Learning, that lets you more systematically try to improve your algorithm, depending on whether it has high bias or high variance issues. So let‘s go on to the next video.
Basic Recipe for Machine Learning - 6m
0:00
In the previous video, you saw how looking at training error and depth error can help you diagnose whether your algorithm has a bias or a variance problem, or maybe both. It turns out that this information that lets you much more systematically using what they call a basic recipe for machine learning and lets you much more systematically go about improving your algorithms‘ performance. Let‘s take a look. When training a neural network, here‘s a basic recipe I will use. After having trained an initial model, I will first ask, does your algorithm have high bias? And so to try and evaluate if there is high bias, you should look at, really, the training set or the training data performance. Right. And so, if it does have high bias, does not even fit in the training set that well, some things you could try would be to try pick a network, such as more hidden layers or more hidden units, or you could train it longer. Maybe run trains longer or try some more advanced optimization algorithms, which we‘ll talk about later in this course. Or you can also try, this is kind of a, maybe it work, maybe it won‘t. But we‘ll see later that there are a lot of different neural network architectures and maybe you can find a new network architecture that‘s better suited for this problem. Putting this in parentheses because one of those things that, you just have to try. Maybe you can make it work, maybe not. Whereas getting a bigger network almost always helps. And training longer doesn‘t always help, but it certainly never hurts. So when training a learning algorithm, I would try these things until I can at least get rid of the bias problems, as in go back after I‘ve tried this and keep doing that until I can fit, at least, fit the training set pretty well. And usually if you have a big enough network, you should usually be able to fit the training data well so long as it‘s a problem that is possible for someone to do, alright? If the image is very blurry, it may be impossible to fit it. But if at least a human can do well on the task, if you think base error is not too high, then by training a big enough network you should be able to, hopefully, do well, at least on the training set. To at least fit or overfit the training set. Once you reduce bias to acceptable amounts then ask, do you have a variance problem? And so to evaluate that I would look at dev set performance. Are you able to generalize from a pretty good training set performance to having a pretty good dev set performance? And if you have high variance, well, best way to solve a high variance problem is to get more data. If you can get it this, you know, can only help. But sometimes you can‘t get more data. Or you could try regularization, which we‘ll talk about in the next video, to try to reduce overfitting. And then also, again, sometimes you just have to try it. But if you can find a more appropriate neural network architecture, sometimes that can reduce your variance problem as well, as well as reduce your bias problem. But how to do that? It‘s harder to be totally systematic how you do that. But so I try these things and I kind of keep going back, until hopefully you find something with both low bias and low variance, whereupon you would be done. So a couple of points to notice. First is that, depending on whether you have high bias or high variance, the set of things you should try could be quite different. So I‘ll usually use the training dev set to try to diagnose if you have a bias or variance problem, and then use that to select the appropriate subset of things to try. So for example, if you actually have a high bias problem, getting more training data is actually not going to help. Or at least it‘s not the most efficient thing to do. So being clear on how much of a bias problem or variance problem or both can help you focus on selecting the most useful things to try. Second, in the earlier era of machine learning, there used to be a lot of discussion on what is called the bias variance tradeoff. And the reason for that was that, for a lot of the things you could try, you could increase bias and reduce variance, or reduce bias and increase variance. But back in the pre-deep learning era, we didn‘t have many tools, we didn‘t have as many tools that just reduce bias or that just reduce variance without hurting the other one. But in the modern deep learning, big data era, so long as you can keep training a bigger network, and so long as you can keep getting more data, which isn‘t always the case for either of these, but if that‘s the case, then getting a bigger network almost always just reduces your bias without necessarily hurting your variance, so long as you regularize appropriately. And getting more data pretty much always reduces your variance and doesn‘t hurt your bias much. So what‘s really happened is that, with these two steps, the ability to train, pick a network, or get more data, we now have tools to drive down bias and just drive down bias, or drive down variance and just drive down variance, without really hurting the other thing that much. And I think this has been one of the big reasons that deep learning has been so useful for supervised learning, that there‘s much less of this tradeoff where you have to carefully balance bias and variance, but sometimes you just have more options for reducing bias or reducing variance without necessarily increasing the other one. And, in fact, [inaudible] you have a well regularized network. We‘ll talk about regularization starting from the next video. Training a bigger network almost never hurts. And the main cost of training a neural network that‘s too big is just computational time, so long as you‘re regularizing. So I hope this gives you a sense of the basic structure of how to organize your machine learning problem to diagnose bias and variance, and then try to select the right operation for you to make progress on your problem. One of the things I mentioned several times in the video is regularization, is a very useful technique for reducing variance. There is a little bit of a bias variance tradeoff when you use regularization. It might increase the bias a little bit, although often not too much if you have a huge enough network. But let‘s dive into more details in the next video so you can better understand how to apply regularization to your neural network.
Regularization - 9m
0:00
If you suspect your neural network is over fitting your data. That is you have a high variance problem, one of the first things you should try per probably regularization. The other way to address high variance, is to get more training data that‘s also quite reliable. But you can‘t always get more training data, or it could be expensive to get more data. But adding regularization will often help to prevent overfitting, or to reduce the errors in your network. So let‘s see how regularization works. Let‘s develop these ideas using logistic regression. Recall that for logistic regression, you try to minimize the cost function J, which is defined as this cost function. Some of your training examples of the losses of the individual predictions in the different examples, where you recall that w and b in the logistic regression, are the parameters. So w is an x-dimensional parameter vector, and b is a real number. And so to add regularization to the logistic regression, what you do is add to it this thing, lambda, which is called the regularization parameter. I‘ll say more about that in a second. But lambda/2m times the norm of w squared. So here, the norm of w squared is just equal to sum from j equals 1 to nx of wj squared, or this can also be written w transpose w, it‘s just a square Euclidean norm of the prime to vector w. And this is called L2 regularization.
1:33
Because here, you‘re using the Euclidean normals, or else the L2 norm with the prime to vector w. Now, why do you regularize just the parameter w? Why don‘t we add something here about b as well? In practice, you could do this, but I usually just omit this. Because if you look at your parameters, w is usually a pretty high dimensional parameter vector, especially with a high variance problem. Maybe w just has a lot of parameters, so you aren‘t fitting all the parameters well, whereas b is just a single number. So almost all the parameters are in w rather b. And if you add this last term, in practice, it won‘t make much of a difference, because b is just one parameter over a very large number of parameters. In practice, I usually just don‘t bother to include it. But you can if you want. So L2 regularization is the most common type of regularization. You might have also heard of some people talk about L1 regularization. And that‘s when you add, instead of this L2 norm, you instead add a term that is lambda/m of sum over of this. And this is also called the L1 norm of the parameter vector w, so the little subscript 1 down there, right? And I guess whether you put m or 2m in the denominator, is just a scaling constant. If you use L1 regularization, then w will end up being sparse. And what that means is that the w vector will have a lot of zeros in it. And some people say that this can help with compressing the model, because the set of parameters are zero, and you need less memory to store the model. Although, I find that, in practice, L1 regularization to make your model sparse, helps only a little bit. So I don‘t think it‘s used that much, at least not for the purpose of compressing your model. And when people train your networks, L2 regularization is just used much much more often. Sorry, just fixing up some of the notation here. So one last detail. Lambda here is called the regularization, Parameter.
3:45
And usually, you set this using your development set, or using [INAUDIBLE] cross validation. When you a variety of values and see what does the best, in terms of trading off between doing well in your training set versus also setting that two normal of your parameters to be small. Which helps prevent over fitting. So lambda is another hyper parameter that you might have to tune. And by the way, for the programming exercises, lambda is a reserved keyword in the Python programming language. So in the programming exercise, we‘ll have lambd,
4:19
without the a, so as not to clash with the reserved keyword in Python. So we use lambd to represent the lambda regularization parameter.
4:29
So this is how you implement L2 regularization for logistic regression. How about a neural network? In a neural network, you have a cost function that‘s a function of all of your parameters, w[1], b[1] through w[L], b[L], where capital L is the number of layers in your neural network. And so the cost function is this, sum of the losses, summed over your m training examples. And says at regularization, you add lambda over 2m of sum over all of your parameters W, your parameter matrix is w, of their, that‘s called the squared norm. Where this norm of a matrix, meaning the squared norm is defined as the sum of the i sum of j, of each of the elements of that matrix, squared. And if you want the indices of this summation. This is sum from i=1 through n[l-1]. Sum from j=1 through n[l], because w is an n[l-1] by n[l] dimensional matrix, where these are the number of units in layers [l-1] in layer l. So this matrix norm, it turns out is called the Frobenius norm of the matrix, denoted with a F in the subscript. So for arcane linear algebra technical reasons, this is not called the l2 normal of a matrix. Instead, it‘s called the Frobenius norm of a matrix. I know it sounds like it would be more natural to just call the l2 norm of the matrix, but for really arcane reasons that you don‘t need to know, by convention, this is called the Frobenius norm. It just means the sum of square of elements of a matrix. So how do you implement gradient descent with this? Previously, we would complete dw using backprop, where backprop would give us the partial derivative of J with respect to w, or really w for any given [l]. And then you update w[l], as w[l]- the learning rate times d. So this is before we added this extra regularization term to the objective. Now that we‘ve added this regularization term to the objective, what you do is you take dw and you add to it, lambda/m times w. And then you just compute this update, same as before. And it turns out that with this new definition of dw[l], this new dw[l] is still a correct definition of the derivative of your cost function, with respect to your parameters, now that you‘ve added the extra regularization term at the end.
7:29
And it‘s for this reason that L2 regularization is sometimes also called weight decay. So if I take this definition of dw[l] and just plug it in here, then you see that the update is w[l] = w[l] times the learning rate alpha times the thing from backprop,
7:54
+lambda of m times w[l]. Throw the minus sign there. And so this is equal to w[l]- alpha lambda / m times w[l]- alpha times the thing you got from backpop. And so this term shows that whatever the matrix w[l] is, you‘re going to make it a little bit smaller, right? This is actually as if you‘re taking the matrix w and you‘re multiplying it by 1-alpha lambda/m. You‘re really taking the matrix w and subtracting alpha lambda/m times this. Like you‘re multiplying matrix w by this number, which is going to be a little bit less than 1. So this is why L2 norm regularization is also called weight decay. Because it‘s just like the ordinally gradient descent, where you update w by subtracting alpha times the original gradient you got from backprop. But now you‘re also multiplying w by this thing, which is a little bit less than 1. So the alternative name for L2 regularization is weight decay. I‘m not really going to use that name, but the intuition for it‘s called weight decay is that this first term here, is equal to this. So you‘re just multiplying the weight metrics by a number slightly less than 1. So that‘s how you implement L2 regularization in neural network.
9:29
Now, one question that [INAUDIBLE] has asked me is, hey, Andrew, why does regularization prevent over-fitting? Let‘s look at the next video, and gain some intuition for how regularization prevents over-fitting.
Why regularization reduces overfitting? - 7m
0:00
Why does regularization help with overfitting? Why does it help with reducing variance problems? Let‘s go through a couple examples to gain some intuition about how it works. So, recall that high bias, high variance. And I just write pictures from our earlier video that looks something like this. Now, let‘s see a fitting large and deep neural network. I know I haven‘t drawn this one too large or too deep, unless you think some neural network and this currently overfitting. So you have some cost function like J of W, B equals sum of the losses. So what we did for regularization was add this extra term that penalizes the weight matrices from being too large. So that was the Frobenius norm. So why is it that shrinking the L two norm or the Frobenius norm or the parameters might cause less overfitting? One piece of intuition is that if you crank regularisation lambda to be really, really big, they‘ll be really incentivized to set the weight matrices W to be reasonably close to zero. So one piece of intuition is maybe it set the weight to be so close to zero for a lot of hidden units that‘s basically zeroing out a lot of the impact of these hidden units. And if that‘s the case, then this much simplified neural network becomes a much smaller neural network. In fact, it is almost like a logistic regression unit, but stacked most probably as deep. And so that will take you from this overfitting case much closer to the left to other high bias case. But hopefully there‘ll be an intermediate value of lambda that results in a result closer to this just right case in the middle. But the intuition is that by cranking up lambda to be really big they‘ll set W close to zero, which in practice this isn‘t actually what happens. We can think of it as zeroing out or at least reducing the impact of a lot of the hidden units so you end up with what might feel like a simpler network. They get closer and closer as if you‘re just using logistic regression. The intuition of completely zeroing out of a bunch of hidden units isn‘t quite right. It turns out that what actually happens is they‘ll still use all the hidden units, but each of them would just have a much smaller effect. But you do end up with a simpler network and as if you have a smaller network that is therefore less prone to overfitting. So a lot of this intuition helps better when you implement regularization in the program exercise, you actually see some of these variance reduction results yourself. Here‘s another attempt at additional intuition for why regularization helps prevent overfitting. And for this, I‘m going to assume that we‘re using the tanh activation function which looks like this. This is a g of z equals tanh of z. So if that‘s the case, notice that so long as Z is quite small, so if Z takes on only a smallish range of parameters, maybe around here, then you‘re just using the linear regime of the tanh function. Is only if Z is allowed to wander up to larger values or smaller values like so, that the activation function starts to become less linear. So the intuition you might take away from this is that if lambda, the regularization parameter, is large, then you have that your parameters will be relatively small, because they are penalized being large into a cos function. And so if the blades W are small then because Z is equal to W and then technically is plus b, but if W tends to be very small, then Z will also be relatively small. And in particular, if Z ends up taking relatively small values, just in this whole range, then G of Z will be roughly linear. So it‘s as if every layer will be roughly linear. As if it is just linear regression. And we saw in course one that if every layer is linear then your whole network is just a linear network. And so even a very deep network, with a deep network with a linear activation function is at the end they are only able to compute a linear function. So it‘s not able to fit those very very complicated decision. Very non-linear decision boundaries that allow it to really overfit right to data sets like we saw on the overfitting high variance case on the previous slide. So just to summarize, if the regularization becomes very large, the parameters W very small, so Z will be relatively small, kind of ignoring the effects of b for now, so Z will be relatively small or, really, I should say it takes on a small range of values. And so the activation function if is tanh, say, will be relatively linear. And so your whole neural network will be computing something not too far from a big linear function which is therefore pretty simple function rather than a very complex highly non-linear function. And so is also much less able to overfit. And again, when you enter in regularization for yourself in the program exercise, you‘ll be able to see some of these effects yourself. Before wrapping up our def discussion on regularization, I just want to give you one implementational tip. Which is that, when implanting regularization, we took our definition of the cost function J and we actually modified it by adding this extra term that penalizes the weight being too large. And so if you implement gradient descent, one of the steps to debug gradient descent is to plot the cost function J as a function of the number of elevations of gradient descent and you want to see that the cost function J decreases monotonically after every elevation of gradient descent. And if you‘re implementing regularization then please remember that J now has this new definition. If you plot the old definition of J, just this first term, then you might not see a decrease monotonically. So to debug gradient descent make sure that you‘re plotting this new definition of J that includes this second term as well. Otherwise you might not see J decrease monotonically on every single elevation. So that‘s it for L two regularization which is actually a regularization technique that I use the most in training deep learning modules. In deep learning there is another sometimes used regularization technique called dropout regularization. Let‘s take a look at that in the next video.
Dropout Regularization - 9m
0:00
In addition to L2 regularization, another very powerful regularization techniques is called "dropout." Let‘s see how that works. Let‘s say you train a neural network like the one on the left and there‘s over-fitting. Here‘s what you do with dropout. Let me make a copy of the neural network. With dropout, what we‘re going to do is go through each of the layers of the network and set some probability of eliminating a node in neural network. Let‘s say that for each of these layers, we‘re going to- for each node, toss a coin and have a 0.5 chance of keeping each node and 0.5 chance of removing each node. So, after the coin tosses, maybe we‘ll decide to eliminate those nodes, then what you do is actually remove all the outgoing things from that no as well. So you end up with a much smaller, really much diminished network. And then you do back propagation training. There‘s one example on this much diminished network. And then on different examples, you would toss a set of coins again and keep a different set of nodes and then dropout or eliminate different than nodes. And so for each training example, you would train it using one of these neural based networks. So, maybe it seems like a slightly crazy technique. They just go around coding those are random, but this actually works. But you can imagine that because you‘re training a much smaller network on each example or maybe just give a sense for why you end up able to regularize the network, because these much smaller networks are being trained. Let‘s look at how you implement dropout. There are a few ways of implementing dropout. I‘m going to show you the most common one, which is technique called inverted dropout. For the sake of completeness, let‘s say we want to illustrate this with layer l=3. So, in the code I‘m going to write- there will be a bunch of 3s here. I‘m just illustrating how to represent dropout in a single layer. So, what we are going to do is set a vector d and d^3 is going to be the dropout vector for the layer 3. That‘s what the 3 is to be np.random.rand(a). And this is going to be the same shape as a3. And when I see if this is less than some number, which I‘m going to call keep.prob. And so, keep.prob is a number. It was 0.5 on the previous time, and maybe now I‘ll use 0.8 in this example, and there will be the probability that a given hidden unit will be kept. So keep.prob = 0.8, then this means that there‘s a 0.2 chance of eliminating any hidden unit. So, what it does is it generates a random matrix. And this works as well if you have factorized. So d3 will be a matrix. Therefore, each example have a each hidden unit there‘s a 0.8 chance that the corresponding d3 will be one, and a 20% chance there will be zero. So, this random numbers being less than 0.8 it has a 0.8 chance of being one or be true, and 20% or 0.2 chance of being false, of being zero. And then what you are going to do is take your activations from the third layer, let me just call it a3 in this low example. So, a3 has the activations you computate. And you can set a3 to be equal to the old a3, times- There is element wise multiplication. Or you can also write this as a3* = d3. But what this does is for every element of d3 that‘s equal to zero. And there was a 20% chance of each of the elements being zero, just multiply operation ends up zeroing out, the corresponding element of d3. If you do this in python, technically d3 will be a boolean array where value is true and false, rather than one and zero. But the multiply operation works and will interpret the true and false values as one and zero. If you try this yourself in python, you‘ll see. Then finally, we‘re going to take a3 and scale it up by dividing by 0.8 or really dividing by our keep.prob parameter. So, let me explain what this final step is doing. Let‘s say for the sake of argument that you have 50 units or 50 neurons in the third hidden layer. So maybe a3 is 50 by one dimensional or if you- factorization maybe it‘s 50 by m dimensional. So, if you have a 80% chance of keeping them and 20% chance of eliminating them. This means that on average, you end up with 10 units shut off or 10 units zeroed out. And so now, if you look at the value of z^4, z^4 is going to be equal to w^4 * a^3 + b^4. And so, on expectation, this will be reduced by 20%. By which I mean that 20% of the elements of a3 will be zeroed out. So, in order to not reduce the expected value of z^4, what you do is you need to take this, and divide it by 0.8 because this will correct or just a bump that back up by roughly 20% that you need. So it‘s not changed the expected value of a3. And, so this line here is what‘s called the inverted dropout technique. And its effect is that, no matter what you set to keep.prob to, whether it‘s 0.8 or 0.9 or even one, if it‘s set to one then there‘s no dropout, because it‘s keeping everything or 0.5 or whatever, this inverted dropout technique by dividing by the keep.prob, it ensures that the expected value of a3 remains the same. And it turns out that at test time, when you trying to evaluate a neural network, which we‘ll talk about on the next slide, this inverted dropout technique, there is there is line to are due to the green box at dropping out. This makes test time easier because you have less of a scaling problem. By far the most common implementation of dropouts today as far as I know is inverted dropouts. I recommend you just implement this. But there were some early iterations of dropout that missed this divide by keep.prob line, and so at test time the average becomes more and more complicated. But again, people tend not to use those other versions. So, what you do is you use the d vector, and you‘ll notice that for different training examples, you zero out different hidden units. And in fact, if you make multiple passes through the same training set, then on different pauses through the training set, you should randomly zero out different hidden units. So, it‘s not that for one example, you should keep zeroing out the same hidden units is that, on iteration one of grade and descent, you might zero out some hidden units. And on the second iteration of great descent where you go through the training set the second time, maybe you‘ll zero out a different pattern of hidden units. And the vector d or d3, for the third layer, is used to decide what to zero out, both in for prob as well as in that prob. We are just showing for prob here. Now, having trained the algorithm at test time, here‘s what you would do. At test time, you‘re given some x or which you want to make a prediction. And using our standard notation, I‘m going to use a^0, the activations of the zeroes layer to denote just test example x. So what we‘re going to do is not to use dropout at test time in particular which is in a sense. Z^1= w^1.a^0 + b^1. a^1 = g^1(z^1 Z). Z^2 = w^2.a^1 + b^2. a^2 =... And so on. Until you get to the last layer and that you make a prediction y^. But notice that the test time you‘re not using dropout explicitly and you‘re not tossing coins at random, you‘re not flipping coins to decide which hidden units to eliminate. And that‘s because when you are making predictions at the test time, you don‘t really want your output to be random. If you are implementing dropout at test time, that just add noise to your predictions. In theory, one thing you could do is run a prediction process many times with different hidden units randomly dropped out and have it across them. But that‘s computationally inefficient and will give you roughly the same result; very, very similar results to this different procedure as well. And just to mention, the inverted dropout thing, you remember the step on the previous line when we divided by the cheap.prob. The effect of that was to ensure that even when you don‘t see men dropout at test time to the scaling, the expected value of these activations don‘t change. So, you don‘t need to add in an extra funny scaling parameter at test time. That‘s different than when you have that training time. So that‘s dropouts. And when you implement this in week‘s premier exercise, you gain more firsthand experience with it as well. But why does it really work? What I want to do the next video is give you some better intuition about what dropout really is doing. Let‘s go on to the next video.
Understanding Dropout - 7m
0:00
Drop out does this seemingly crazy thing of randomly knocking out units on your network. Why does it work so well with a regularizer? Let‘s gain some better intuition. In the previous video, I gave this intuition that drop-out randomly knocks out units in your network. So it‘s as if on every iteration you‘re working with a smaller neural network, and so using a smaller neural network seems like it should have a regularizing effect. Here‘s a second intuition which is, let‘s look at it from the perspective of a single unit. Let‘s say this one. Now, for this unit to do his job as for inputs and it needs to generate some meaningful output. Now with drop out, the inputs can get randomly eliminated. Sometimes those two units will get eliminated, sometimes a different unit will get eliminated. So, what this means is that this unit, which I‘m circling in purple, it can‘t rely on any one feature because any one feature could go away at random or any one of its own inputs could go away at random. Some particular would be reluctant to put all of its bets on, say, just this input, right? The weights, we‘re reluctant to put too much weight on any one input because it can go away. So this unit will be more motivated to spread out this way and give you a little bit of weight to each of the four inputs to this unit. And by spreading all the weights, this will tend to have an effect of shrinking the squared norm of the weights. And so, similar to what we saw with L2 regularization, the effect of implementing drop out is that it shrinks the weights and does some of those outer regularization that helps prevent over-fitting. But it turns out that drop out can formally be shown to be an adaptive form without a regularization. But L2 penalty on different weights are different, depending on the size of the activations being multiplied that way. But to summarize, it is possible to show that drop out has a similar effect to L2 regularization. Only to L2 regularization applied to different ways can be a little bit different and even more adaptive to the scale of different inputs. One more detail for when you‘re implementing drop out. Here‘s a network where you have three input features. This is seven hidden units here, seven, three, two, one. So, one of the parameters we had to choose was the cheap prop which has a chance of keeping a unit in each layer. So, it is also feasible to vary key prop by layer. So for the first layer, your matrix W1 will be three by seven. Your second weight matrix will be seven by seven. W3 will be seven by three and so on. And so W2 is actually the biggest weight matrix, because they‘re actually the largest set of parameters would be in W2 which is seven by seven. So to prevent, to reduce over-fitting of that matrix, maybe for this layer, I guess this is layer two, you might have a key prop that‘s relatively low, say zero point five, whereas for different layers where you might worry less about over-fitting, you could have a higher key prop, maybe just zero point seven. And if a layers we don‘t worry about over-fitting at all, you can have a key prop of one point zero. For clarity, these are numbers I‘m drawing on the purple boxes. These could be different key props for different layers. Notice that the key prop of one point zero means that you‘re keeping every unit and so, you‘re really not using drop out for that layer. But for layers where you‘re more worried about over-fitting, really the layers with a lot of parameters, you can set the key prop to be smaller to apply a more powerful form of drop out. It‘s kind of like cranking up the regularization parameter lambda of L2 regularization where you try to regularize some layers more than others. And technically, you can also apply drop out to the input layer, where you can have some chance of just maxing out one or more of the input features. Although in practice, usually don‘t do that that often. And so, a key prop of one point zero was quite common for the input there. You can also use a very high value, maybe zero point nine, but it‘s much less likely that you want to eliminate half of the input features. So usually key prop, if you apply the law, will be a number close to one if you even apply drop out at all to the input there. So just to summarize, if you‘re more worried about some layers overfitting than others, you can set a lower key prop for some layers than others. The downside is, this gives you even more hyper parameters to search for using cross-validation. One other alternative might be to have some layers where you apply drop out and some layers where you don‘t apply drop out and then just have one hyper parameter, which is a key prop for the layers for which you do apply drop outs. And before we wrap up, just a couple implementational tips. Many of the first successful implementations of drop outs were to computer vision. So in computer vision, the input size is so big, inputting all these pixels that you almost never have enough data. And so drop out is very frequently used by computer vision. And there‘s some computer vision researchers that pretty much always use it, almost as a default. But really the thing to remember is that drop out is a regularization technique, it helps prevent over-fitting. And so, unless my algorithm is over-fitting, I wouldn‘t actually bother to use drop out. So it‘s used somewhat less often than other application areas. There‘s just with computer vision, you usually just don‘t have enough data, so you‘re almost always overfitting, which is why there tends to be some computer vision researchers who swear by drop out. But their intuition doesn‘t always generalize I think to other disciplines. One big downside of drop out is that the cost function J is no longer well-defined. On every iteration, you are randomly killing off a bunch of nodes. And so, if you are double checking the performance of grade and dissent, it‘s actually harder to double check that you have a well defined cost function J that is going downhill on every iteration. Because the cost function J that you‘re optimizing is actually less. Less well defined, or is certainly hard to calculate. So you lose this debugging tool to will a plot, a graph like this. So what I usually do is turn off drop out, you will set key prop equals one, and I run my code and make sure that it is monotonically decreasing J, and then turn on drop out and hope that I didn‘t introduce bugs into my code during drop out. Because you need other ways, I guess, but not plotting these figures to make sure that your code is working to greatness and it‘s working even with drop outs. So with that, there‘s still a few more regularization techniques that are worth your knowing. Let‘s talk about a few more such techniques in the next video.
Other regularization methods - 8m
0:00
In addition to L2 regularization and drop out regularization there are few other techniques to reducing over fitting in your neural network. Let‘s take a look. Let‘s say you fitting a CAD crossfire. If you are over fitting getting more training data can help, but getting more training data can be expensive and sometimes you just can‘t get more data. But what you can do is augment your training set by taking image like this. And for example, flipping it horizontally and adding that also with your training set. So now instead of just this one example in your training set, you can add this to your training example. So by flipping the images horizontally, you could double the size of your training set. Because you‘re training set is now a bit redundant this isn‘t as good as if you had collected an additional set of brand new independent examples. But you could do this Without needing to pay the expense of going out to take more pictures of cats. And then other than flipping horizontally, you can also take random crops of the image. So here we‘re rotated and sort of randomly zoom into the image and this still looks like a cat. So by taking random distortions and translations of the image you could augment your data set and make additional fake training examples. Again, these extra fake training examples they don‘t add as much information as they were to call they get a brand new independent example of a cat. But because you can do this, almost for free, other than for some confrontational costs. This can be an inexpensive way to give your algorithm more data and therefore sort of regularize it and reduce over fitting. And by synthesizing examples like this what you‘re really telling your algorithm is that If something is a cat then flipping it horizontally is still a cat. Notice I didn‘t flip it vertically, because maybe we don‘t want upside down cats, right? And then also maybe randomly zooming in to part of the image it‘s probably still a cat. For optical character recognition you can also bring your data set by taking digits and imposing random rotations and distortions to it. So If you add these things to your training set, these are also still digit force.
2:14
For illustration I applied a very strong distortion. So this look very wavy for, in practice you don‘t need to distort the four quite as aggressively, but just a more subtle distortion than what I‘m showing here, to make this example clearer for you, right? But a more subtle distortion is usually used in practice, because this looks like really warped fours. So data augmentation can be used as a regularization technique, in fact similar to regularization. There‘s one other technique that is often used called early stopping. So what you‘re going to do is as you run gradient descent you‘re going to plot your, either the training error, you‘ll use 01 classification error on the training set. Or just plot the cost function J optimizing, and that should decrease monotonically, like so, all right? Because as you trade, hopefully, you‘re trading around your cost function J should decrease. So with early stopping, what you do is you plot this, and you also plot your dev set error.
3:17
And again, this could be a classification error in a development sense, or something like the cost function, like the logistic loss or the log loss of the dev set. Now what you find is that your dev set error will usually go down for a while, and then it will increase from there. So what early stopping does is, you will say well, it looks like your neural network was doing best around that iteration, so we just want to stop trading on your neural network halfway and take whatever value achieved this dev set error. So why does this work? Well when you‘ve haven‘t run many iterations for your neural network yet your parameters w will be close to zero. Because with random initialization you probably initialize w to small random values so before you train for a long time, w is still quite small. And as you iterate, as you train, w will get bigger and bigger and bigger until here maybe you have a much larger value of the parameters w for your neural network. So what early stopping does is by stopping halfway you have only a mid-size rate w. And so similar to L2 regularization by picking a neural network with smaller norm for your parameters w, hopefully your neural network is over fitting less. And the term early stopping refers to the fact that you‘re just stopping the training of your neural network earlier. I sometimes use early stopping when training a neural network. But it does have one downside, let me explain. I think of the machine learning process as comprising several different steps. One, is that you want an algorithm to optimize the cost function j and we have various tools to do that, such as grade intersect. And then we‘ll talk later about other algorithms, like momentum and RMS prop and Atom and so on. But after optimizing the cost function j, you also wanted to not over-fit. And we have some tools to do that such as your regularization, getting more data and so on. Now in machine learning, we already have so many hyper-parameters it surge over. It‘s already very complicated to choose among the space of possible algorithms. And so I find machine learning easier to think about when you have one set of tools for optimizing the cost function J, and when you‘re focusing on authorizing the cost function J. All you care about is finding w and b, so that J(w,b) is as small as possible. You just don‘t think about anything else other than reducing this. And then it‘s completely separate task to not over fit, in other words, to reduce variance. And when you‘re doing that, you have a separate set of tools for doing it. And this principle is sometimes called orthogonalization. And there‘s this idea, that you want to be able to think about one task at a time. I‘ll say more about orthorganization in a later video, so if you don‘t fully get the concept yet, don‘t worry about it. But, to me the main downside of early stopping is that this couples these two tasks. So you no longer can work on these two problems independently, because by stopping gradient decent early, you‘re sort of breaking whatever you‘re doing to optimize cost function J, because now you‘re not doing a great job reducing the cost function J. You‘ve sort of not done that that well. And then you also simultaneously trying to not over fit. So instead of using different tools to solve the two problems, you‘re using one that kind of mixes the two. And this just makes the set of
6:52
things you could try are more complicated to think about. Rather than using early stopping, one alternative is just use L2 regularization then you can just train the neural network as long as possible. I find that this makes the search space of hyper parameters easier to decompose, and easier to search over. But the downside of this though is that you might have to try a lot of values of the regularization parameter lambda. And so this makes searching over many values of lambda more computationally expensive. And the advantage of early stopping is that running the gradient descent process just once, you get to try out values of small w, mid-size w, and large w, without needing to try a lot of values of the L2 regularization hyperparameter lambda. If this concept doesn‘t completely make sense to you yet, don‘t worry about it. We‘re going to talk about orthogonalization in greater detail in a later video, I think this will make a bit more sense. Despite it‘s disadvantages, many people do use it. I personally prefer to just use L2 regularization and try different values of lambda. That‘s assuming you can afford the computation to do so. But early stopping does let you get a similar effect without needing to explicitly try lots of different values of lambda. So you‘ve now seen how to use data augmentation as well as if you wish early stopping in order to reduce variance or prevent over fitting your neural network. Next let‘s talk about some techniques for setting up your optimization problem to make your training go quickly.
Normalizing inputs - 5m
0:00
When training a neural network, one of the techniques that will speed up your training is if you normalize your inputs. Let‘s see what that means. Let‘s see if a training sets with two input features. So the input features x are two dimensional, and here‘s a scatter plot of your training set. Normalizing your inputs corresponds to two steps. The first is to subtract out or to zero out the mean. So you set mu = 1 over M sum over I of Xi. So this is a vector, and then X gets set as X- mu for every training example, so this means you just move the training set until it has 0 mean. And then the second step is to normalize the variances. So notice here that the feature X1 has a much larger variance than the feature X2 here. So what we do is set sigma = 1 over m sum of Xi**2. I guess this is a element y squaring. And so now sigma squared is a vector with the variances of each of the features, and notice we‘ve already subtracted out the mean, so Xi squared, element y squared is just the variances. And you take each example and divide it by this vector sigma squared. And so in pictures, you end up with this. Where now the variance of X1 and X2 are both equal to one.
1:35
And one tip, if you use this to scale your training data, then use the same mu and sigma squared to normalize your test set, right? In particular, you don‘t want to normalize the training set and the test set differently. Whatever this value is and whatever this value is, use them in these two formulas so that you scale your test set in exactly the same way, rather than estimating mu and sigma squared separately on your training set and test set. Because you want your data, both training and test examples, to go through the same transformation defined by the same mu and sigma squared calculated on your training data. So, why do we do this? Why do we want to normalize the input features? Recall that a cost function is defined as written on the top right. It turns out that if you use unnormalized input features, it‘s more likely that your cost function will look like this, it‘s a very squished out bowl, very elongated cost function, where the minimum you‘re trying to find is maybe over there. But if your features are on very different scales, say the feature X1 ranges from 1 to 1,000, and the feature X2 ranges from 0 to 1, then it turns out that the ratio or the range of values for the parameters w1 and w2 will end up taking on very different values. And so maybe these axes should be w1 and w2, but I‘ll plot w and b, then your cost function can be a very elongated bowl like that. So if you part the contours of this function, you can have a very elongated function like that. Whereas if you normalize the features, then your cost function will on average look more symmetric. And if you‘re running gradient descent on the cost function like the one on the left, then you might have to use a very small learning rate because if you‘re here that gradient descent might need a lot of steps to oscillate back and forth before it finally finds its way to the minimum. Whereas if you have a more spherical contours, then wherever you start gradient descent can pretty much go straight to the minimum. You can take much larger steps with gradient descent rather than needing to oscillate around like like the picture on the left. Of course in practice w is a high-dimensional vector, and so trying to plot this in 2D doesn‘t convey all the intuitions correctly. But the rough intuition that your cost function will be more round and easier to optimize when your features are all on similar scales. Not from one to 1000, zero to one, but mostly from minus one to one or of about similar variances of each other. That just makes your cost function J easier and faster to optimize. In practice if one feature, say X1, ranges from zero to one, and X2 ranges from minus one to one, and X3 ranges from one to two, these are fairly similar ranges, so this will work just fine. It‘s when they‘re on dramatically different ranges like ones from 1 to a 1000, and the another from 0 to 1, that that really hurts your authorization algorithm. But by just setting all of them to a 0 mean and say, variance 1, like we did in the last slide, that just guarantees that all your features on a similar scale and will usually help your learning algorithm run faster. So, if your input features came from very different scales, maybe some features are from 0 to 1, some from 1 to 1,000, then it‘s important to normalize your features. If your features came in on similar scales, then this step is less important. Although performing this type of normalization pretty much never does any harm, so I‘ll often do it anyway if I‘m not sure whether or not it will help with speeding up training for your algebra.
5:22
So that‘s it for normalizing your input features. Next, let‘s keep talking about ways to speed up the training of your new network.
Vanishing / Exploding gradients - 6m
0:00
One of the problems of training neural network, especially very deep neural networks, is data vanishing and exploding gradients. What that means is that when you‘re training a very deep network your derivatives or your slopes can sometimes get either very, very big or very, very small, maybe even exponentially small, and this makes training difficult. In this video you see what this problem of exploding and vanishing gradients really means, as well as how you can use careful choices of the random weight initialization to significantly reduce this problem. Unless you‘re training a very deep neural network like this, to save space on the slide, I‘ve drawn it as if you have only two hidden units per layer, but it could be more as well. But this neural network will have parameters W1, W2, W3 and so on up to WL. For the sake of simplicity, let‘s say we‘re using an activation function G of Z equals Z, so linear activation function. And let‘s ignore B, let‘s say B of L equals zero. So in that case you can show that the output Y will be WL times WL minus one times WL minus two, dot, dot, dot down to the W3, W2, W1 times X. But if you want to just check my math, W1 times X is going to be Z1, because B is equal to zero. So Z1 is equal to, I guess, W1 times X and then plus B which is zero. But then A1 is equal to G of Z1. But because we use linear activation function, this is just equal to Z1. So this first term W1X is equal to A1. And then by the reasoning you can figure out that W2 times W1 times X is equal to A2, because that‘s going to be G of Z2, is going to be G of W2 times A1 which you can plug that in here. So this thing is going to be equal to A2, and then this thing is going to be A3 and so on until the protocol of all these matrices gives you Y-hat, not Y. Now, let‘s say that each of you weight matrices WL is just a little bit larger than one times the identity. So it‘s 1.5_1.5_0_0. Technically, the last one has different dimensions so maybe this is just the rest of these weight matrices. Then Y-hat will be, ignoring this last one with different dimension, this 1.5_0_0_1.5 matrix to the power of L minus 1 times X, because we assume that each one of these matrices is equal to this thing. It‘s really 1.5 times the identity matrix, then you end up with this calculation. And so Y-hat will be essentially 1.5 to the power of L, to the power of L minus 1 times X, and if L was large for very deep neural network, Y-hat will be very large. In fact, it just grows exponentially, it grows like 1.5 to the number of layers. And so if you have a very deep neural network, the value of Y will explode. Now, conversely, if we replace this with 0.5, so something less than 1, then this becomes 0.5 to the power of L. This matrix becomes 0.5 to the L minus one times X, again ignoring WL. And so each of your matrices are less than 1, then let‘s say X1, X2 were one one, then the activations will be one half, one half, one fourth, one fourth, one eighth, one eighth, and so on until this becomes one over two to the L. So the activation values will decrease exponentially as a function of the def, as a function of the number of layers L of the network. So in the very deep network, the activations end up decreasing exponentially. So the intuition I hope you can take away from this is that at the weights W, if they‘re all just a little bit bigger than one or just a little bit bigger than the identity matrix, then with a very deep network the activations can explode. And if W is just a little bit less than identity. So this maybe here‘s 0.9, 0.9, then you have a very deep network, the activations will decrease exponentially. And even though I went through this argument in terms of activations increasing or decreasing exponentially as a function of L, a similar argument can be used to show that the derivatives or the gradients the computer is going to send will also increase exponentially or decrease exponentially as a function of the number of layers. With some of the modern neural networks, L equals 150. Microsoft recently got great results with 152 layer neural network. But with such a deep neural network, if your activations or gradients increase or decrease exponentially as a function of L, then these values could get really big or really small. And this makes training difficult, especially if your gradients are exponentially smaller than L, then gradient descent will take tiny little steps. It will take a long time for gradient descent to learn anything. To summarize, you‘ve seen how deep networks suffer from the problems of vanishing or exploding gradients. In fact, for a long time this problem was a huge barrier to training deep neural networks. It turns out there‘s a partial solution that doesn‘t completely solve this problem but it helps a lot which is careful choice of how you initialize the weights. To see that, let‘s go to the next video.
Weight Initialization for Deep Networks - 6m
0:00
In the last video you saw how very deep neural networks can have the problems of vanishing and exploding gradients. It turns out that a partial solution to this, doesn‘t solve it entirely but helps a lot, is better or more careful choice of the random initialization for your neural network. To understand this lets start with the example of initializing the ways for a single neuron and then we‘re going to generalize this to a deep network. Let‘s go through this with an example with just a single neuron and then we‘ll talk about the deep net later. So a single neuron you might input four features x1 through x4 and then you have some a=g(z) and end it up with some y and later on for a deeper net you know these inputs will be right, some layer a(l), but for now let‘s just call this x for now. So z is going to be equal to w1x1 + w2x2 +... + I guess WnXn and let‘s set b=0 so you know lets just ignore b for now. So in order to make z not blow up and not become too small you notice that the larger n is, the smaller you want Wi to be, right? Because z is the sum of the WiXi and so if you‘re adding up a lot of these terms you want each of these terms to be smaller. One reasonable thing to do would be to set the variance of Wi to be equal to 1 over n, where n is the number of input features that‘s going into a neuron. So in practice, what you can do is set the weight matrix W for a certain layer to be np.random.randn you know, and then whatever the shape of the matrix is for this out here, and then times square root of 1 over the number of features that I fed into each neuron and there else is going to be n(l-1) because that‘s the number of units that I‘m feeding into each of the units and they are l. It turns out that if you‘re using a value activation function that rather than 1 over n it turns out that, set in the variance that 2 over n works a little bit better. So you often see that in initialization especially if you‘re using a value activation function so if gl(z) is ReLu(z), oh and it depend on how familiar you are with random variables. It turns out that something, a Gaussian random variable and then multiplying it by a square root of this, that says the variance to be quoted this way, to be to 2 over n and the reason I went from n to this n superscript l-1 was, in this example with logistic regression which is to input features but the more general case they are l would have an l-1 inputs each of the units in that layer. So if the input features of activations are roughly mean 0 and standard variance and variance 1 then this would cause z to also take on a similar scale and this doesn‘t solve, but it definitely helps reduce the vanishing, exploding gradients problem because it‘s trying to set each of the weight matrices w you know so that it‘s not too much bigger than 1 and not too much less than 1 so it doesn‘t explode or vanish too quickly. I‘ve just mention some other variants. The version we just described is assuming a value activation function and this by a paper by [inaudible]. A few other variants, if you are using a TanH activation function then there‘s a paper that shows that instead of using the constant 2 it‘s better use the constant 1 and so 1 over this instead of 2 and so you multiply it by the square root of this. So this square root term whoever plays this term and you use this if you‘re using a TanH activation function. This is called Xavier initialization. And another version we‘re taught by Yoshua Bengio and his colleagues, you might see in some papers, but is to use this formula, which you know has some other theoretical justification, but I would say if you‘re using a value activation function, which is really the most common activation function, I would use this formula. If you‘re using TanH you could try this version instead and some authors will also use this but in practice I think all of these formulas just give you a starting point, it gives you a default value to use for the variance of the initialization of your weight matrices. If you wish the variance here, this variance parameter could be another thing that you could tune of your hyperparameters so you could have another parameter that multiplies into this formula and tune that multiplier as part of your hyperparameter surge. Sometimes tuning the hyperparameter has a modest size effect. It‘s not one of the first hyperparameters I would usually try to tune but I‘ve also seen some problems with tuning this you know helps a reasonable amount but this is usually lower down for me in terms of how important it is relative to the other hyperparameters you can tune. So I hope that gives you some intuition about the problem of vanishing or exploding gradients as well as how choosing a reasonable scaling for how you initialize the weights. Hopefully that makes your weights you know not explode too quickly and not decay to zero too quickly so you can train a reasonably deep network without the weights or the gradients exploding or vanishing too much. When you train deep networks this is another trick that will help you make your neural networks trained much.
Numerical approximation of gradients - 6m
0:00
When you implement back propagation you‘ll find that there‘s a test called creating checking that can really help you make sure that your implementation of back prop is correct. Because sometimes you write all these equations and you‘re just not 100% sure if you‘ve got all the details right and internal back propagation. So in order to build up to gradient and checking, let‘s first talk about how to numerically approximate computations of gradients and in the next video, we‘ll talk about how you can implement gradient checking to make sure the implementation of backdrop is correct. So lets take the function f and replot it here and remember this is f of theta equals theta cubed, and let‘s again start off to some value of theta. Let‘s say theta equals 1. Now instead of just nudging theta to the right to get theta plus epsilon, we‘re going to nudge it to the right and nudge it to the left to get theta minus epsilon, as was theta plus epsilon. So this is 1, this is 1.01, this is 0.99 where, again, epsilon is same as before, it is 0.01. It turns out that rather than taking this little triangle and computing the height over the width, you can get a much better estimate of the gradient if you take this point, f of theta minus epsilon and this point, and you instead compute the height over width of this bigger triangle. So for technical reasons which I won‘t go into, the height over width of this bigger green triangle gives you a much better approximation to the derivative at theta. And you saw it yourself, taking just this lower triangle in the upper right is as if you have two triangles, right? This one on the upper right and this one on the lower left. And you‘re kind of taking both of them into account by using this bigger green triangle. So rather than a one sided difference, you‘re taking a two sided difference. So let‘s work out the math. This point here is F of theta plus epsilon. This point here is F of theta minus epsilon. So the height of this big green triangle is f of theta plus epsilon minus f of theta minus epsilon. And then the width, this is 1 epsilon, this is 2 epsilon. So the width of this green triangle is 2 epsilon. So the height of the width is going to be first the height, so that‘s F of theta plus epsilon minus F of theta minus epsilon divided by the width. So that was 2 epsilon which we write that down here.
2:38
And this should hopefully be close to g of theta. So plug in the values, remember f of theta is theta cubed. So this is theta plus epsilon is 1.01. So I take a cube of that minus 0.99 theta cube of that divided by 2 times 0.01. Feel free to pause the video and practice in the calculator. You should get that this is 3.0001. Whereas from the previous slide, we saw that g of theta, this was 3 theta squared so when theta was 1, so these two values are actually very close to each other. The approximation error is now 0.0001. Whereas on the previous slide, we‘ve taken the one sided of difference just theta + theta + epsilon we had gotten 3.0301 and so the approximation error was 0.03 rather than 0.0001. So this two sided difference way of approximating the derivative you find that this is extremely close to 3. And so this gives you a much greater confidence that g of theta is probably a correct implementation of the derivative of F.
3:58
When you use this method for grading, checking and back propagation, this turns out to run twice as slow as you were to use a one-sided defense. It turns out that in practice I think it‘s worth it to use this other method because it‘s just much more accurate. The little bit of optional theory for those of you that are a little bit more familiar of Calculus, it turns out that, and it‘s okay if you don‘t get what I‘m about to say here. But it turns out that the formal definition of a derivative is for very small values of epsilon is f of theta plus epsilon minus f of theta minus epsilon over 2 epsilon. And the formal definition of derivative is in the limits of exactly that formula on the right as epsilon those as 0. And the definition of unlimited is something that you learned if you took a Calculus class but I won‘t go into that here. And it turns out that for a non zero value of epsilon, you can show that the error of this approximation is on the order of epsilon squared, and remember epsilon is a very small number. So if epsilon is 0.01 which it is here then epsilon squared is 0.0001. The big O notation means the error is actually some constant times this, but this is actually exactly our approximation error. So the big O constant happens to be 1. Whereas in contrast if we were to use this formula, the other one, then the error is on the order of epsilon. And again, when epsilon is a number less than 1, then epsilon is actually much bigger than epsilon squared which is why this formula here is actually much less accurate approximation than this formula on the left. Which is why when doing gradient checking, we rather use this two-sided difference when you compute f of theta plus epsilon minus f of theta minus epsilon and then divide by 2 epsilon rather than just one sided difference which is less accurate.
5:53
If you didn‘t understand my last two comments, all of these things are on here. Don‘t worry about it. That‘s really more for those of you that are a bit more familiar with Calculus, and with numerical approximations. But the takeaway is that this two-sided difference formula is much more accurate. And so that‘s what we‘re going to use when we do gradient checking in the next video.
6:13
So you‘ve seen how by taking a two sided difference, you can numerically verify whether or not a function g, g of theta that someone else gives you is a correct implementation of the derivative of a function f. Let‘s now see how we can use this to verify whether or not your back propagation implementation is correct or if there might be a bug in there that you need to go and tease out
Gradient checking - 6m
0:00
Gradient checking is a technique that‘s helped me save tons of time, and helped me find bugs in my implementations of back propagation many times. Let‘s see how you could use it too to debug, or to verify that your implementation and back process correct. So your new network will have some sort of parameters, W1, B1 and so on up to WL bL. So to implement gradient checking, the first thing you should do is take all your parameters and reshape them into a giant vector data. So what you should do is take W which is a matrix, and reshape it into a vector. You gotta take all of these Ws and reshape them into vectors, and then concatenate all of these things, so that you have a giant vector theta. Giant vector pronounced as theta. So we say that the cos function J being a function of the Ws and Bs, You would now have the cost function J being just a function of theta. Next, with W and B ordered the same way, you can also take dW[1], db[1] and so on, and initiate them into big, giant vector d theta of the same dimension as theta. So same as before, we shape dW[1] into the matrix, db[1] is already a vector. We shape dW[L], all of the dW‘s which are matrices. Remember, dW1 has the same dimension as W1. db1 has the same dimension as b1. So the same sort of reshaping and concatenation operation, you can then reshape all of these derivatives into a giant vector d theta. Which has the same dimension as theta. So the question is, now, is the theta the gradient or the slope of the cos function J? So here‘s how you implement gradient checking, and often abbreviate gradient checking to grad check. So first we remember that J Is now a function of the giant parameter, theta, right? So expands to j is a function of theta 1, theta 2, theta 3, and so on.
2:06
Whatever‘s the dimension of this giant parameter vector theta. So to implement grad check, what you‘re going to do is implements a loop so that for each I, so for each component of theta, let‘s compute D theta approx i to b. And let me take a two sided difference. So I‘ll take J of theta. Theta 1, theta 2, up to theta i. And we‘re going to nudge theta i to add epsilon to this. So just increase theta i by epsilon, and keep everything else the same. And because we‘re taking a two sided difference, we‘re going to do the same on the other side with theta i, but now minus epsilon. And then all of the other elements of theta are left alone. And then we‘ll take this, and we‘ll divide it by 2 theta. And what we saw from the previous video is that this should be approximately equal to d theta i. Of which is supposed to be the partial derivative of J or of respect to, I guess theta i, if d theta i is the derivative of the cost function J. So what you going to do is you‘re going to compute to this for every value of i. And at the end, you now end up with two vectors. You end up with this d theta approx, and this is going to be the same dimension as d theta. And both of these are in turn the same dimension as theta. And what you want to do is check if these vectors are approximately equal to each other. So, in detail, well how you do you define whether or not two vectors are really reasonably close to each other? What I do is the following. I would compute the distance between these two vectors, d theta approx minus d theta, so just the o2 norm of this. Notice there‘s no square on top, so this is the sum of squares of elements of the differences, and then you take a square root, as you get the Euclidean distance. And then just to normalize by the lengths of these vectors, divide by d theta approx plus d theta. Just take the Euclidean lengths of these vectors. And the row for the denominator is just in case any of these vectors are really small or really large, your the denominator turns this formula into a ratio. So we implement this in practice, I use epsilon equals maybe 10 to the minus 7, so minus 7. And with this range of epsilon, if you find that this formula gives you a value like 10 to the minus 7 or smaller, then that‘s great. It means that your derivative approximation is very likely correct. This is just a very small value. If it‘s maybe on the range of 10 to the -5, I would take a careful look. Maybe this is okay. But I might double-check the components of this vector, and make sure that none of the components are too large. And if some of the components of this difference are very large, then maybe you have a bug somewhere. And if this formula on the left is on the other is -3, then I would wherever you have would be much more concerned that maybe there‘s a bug somewhere. But you should really be getting values much smaller then 10 minus 3. If any bigger than 10 to minus 3, then I would be quite concerned. I would be seriously worried that there might be a bug. And I would then, you should then look at the individual components of data to see if there‘s a specific value of i for which d theta across i is very different from d theta i. And use that to try to track down whether or not some of your derivative computations might be incorrect. And after some amounts of debugging, it finally, it ends up being this kind of very small value, then you probably have a correct implementation. So when implementing a neural network, what often happens is I‘ll implement foreprop, implement backprop. And then I might find that this grad check has a relatively big value. And then I will suspect that there must be a bug, go in debug, debug, debug. And after debugging for a while, If I find that it passes grad check with a small value, then you can be much more confident that it‘s then correct. So you now know how gradient checking works. This has helped me find lots of bugs in my implementations of neural nets, and I hope it‘ll help you too. In the next video, I want to share with you some tips or some notes on how to actually implement gradient checking. Let‘s go onto the next video.
Gradient Checking Implementation Notes - 5m
0:00
In the last video you learned about gradient checking. In this video, I want to share with you some practical tips or some notes on how to actually go about implementing this for your neural network. First, don‘t use grad check in training, only to debug. So what I mean is that, computing d theta approx i, for all the values of i, this is a very slow computation. So to implement gradient descent, you‘d use backprop to compute d theta and just use backprop to compute the derivative. And it‘s only when you‘re debugging that you would compute this to make sure it‘s close to d theta. But once you‘ve done that, then you would turn off the grad check, and don‘t run this during every iteration of gradient descent, because that‘s just much too slow. Second, if an algorithm fails grad check, look at the components, look at the individual components, and try to identify the bug. So what I mean by that is if d theta approx is very far from d theta, what I would do is look at the different values of i to see which are the values of d theta approx that are really very different than the values of d theta. So for example, if you find that the values of theta or d theta, they‘re very far off, all correspond to dbl for some layer or for some layers, but the components for dw are quite close, right? Remember, different components of theta correspond to different components of b and w. When you find this is the case, then maybe you find that the bug is in how you‘re computing db, the derivative with respect to parameters b. And similarly, vice versa, if you find that the values that are very far, the values from d theta approx that are very far from d theta, you find all those components came from dw or from dw in a certain layer, then that might help you hone in on the location of the bug. This doesn‘t always let you identify the bug right away, but sometimes it helps you give you some guesses about where to track down the bug.
1:56
Next, when doing grad check, remember your regularization term if you‘re using regularization. So if your cost function is J of theta equals 1 over m sum of your losses and then plus this regularization term. And sum over l of wl squared, then this is the definition of J. And you should have that d theta is gradient of J with respect to theta, including this regularization term. So just remember to include that term. Next, grad check doesn‘t work with dropout, because in every iteration, dropout is randomly eliminating different subsets of the hidden units. There isn‘t an easy to compute cost function J that dropout is doing gradient descent on. It turns out that dropout can be viewed as optimizing some cost function J, but it‘s cost function J defined by summing over all exponentially large subsets of nodes they could eliminate in any iteration. So the cost function J is very difficult to compute, and you‘re just sampling the cost function every time you eliminate different random subsets in those we use dropout. So it‘s difficult to use grad check to double check your computation with dropouts. So what I usually do is implement grad check without dropout. So if you want, you can set keep-prob and dropout to be equal to 1.0. And then turn on dropout and hope that my implementation of dropout was correct.
3:30
There are some other things you could do, like fix the pattern of nodes dropped and verify that grad check for that pattern of [INAUDIBLE] is correct, but in practice I don‘t usually do that. So my recommendation is turn off dropout, use grad check to double check that your algorithm is at least correct without dropout, and then turn on dropout. Finally, this is a subtlety. It is not impossible, rarely happens, but it‘s not impossible that your implementation of gradient descent is correct when w and b are close to 0, so at random initialization. But that as you run gradient descent and w and b become bigger, maybe your implementation of backprop is correct only when w and b is close to 0, but it gets more inaccurate when w and b become large. So one thing you could do, I don‘t do this very often, but one thing you could do is run grad check at random initialization and then train the network for a while so that w and b have some time to wander away from 0, from your small random initial values. And then run grad check again after you‘ve trained for some number of iterations. So that‘s it for gradient checking. And congratulations for coming to the end of this week‘s materials. In this week, you‘ve learned about how to set up your train, dev, and test sets, how to analyze bias and variance and what things to do if you have high bias versus high variance versus maybe high bias and high variance. You also saw how to apply different forms of regularization, like L2 regularization and dropout on your neural network. So some tricks for speeding up the training of your neural network. And then finally, gradient checking. So I think you‘ve seen a lot in this week and you get to exercise a lot of these ideas in this week‘s programming exercise. So best of luck with that, and I look forward to seeing you in the week two materials.
(Optional) Heroes of Deep Learning - Yoshua Bengio interview - 25m
0:03
Hi, Yoshua, I‘m really glad you could join us here today. >> I‘m very glad, too. >> Today you‘re not just a researcher or engineer in deep learning. You‘ve become one of the institutions and one of the icons of deep learning, but I‘d really like to hear the story of how it started. So how did you end up getting into deep learning, and then pursuing this journey? >> Right, well, actually, it started when I was a kid, adolescent, reading a lot of science fiction, like, I guess, many of us. And when I started my graduate studies in 1985, I started reading neural net papers, and that‘s where I got all excited, and it became really a passion. >> And actually, what was that like in, what, mid 80s, right, 1985, reading these papers, do you remember? >> Yeah.
0:59
Well, coming from the courses I had taking in classical AI with expert systems, and suddenly discovering that there was all this world of thinking about how humans might be learning, and human intelligence. And how we might draw connections between that and artificial intelligence and computers. That was really exciting for me when I discovered this literature, and I started reading the connectionists, of course. So the papers from Geoff Hinton, [INAUDIBLE], and so on. And I worked on recurrent nets, I worked on speech recognition, I worked on HMNs, so graphical models. And then quickly, I moved to AT&T Bell Labs and MIT, where I did postdocs. And that‘s where I discovered some of the issues with the long-term dependencies with training neural nets. And then shortly after, I got recruited at UdeM back in Montreal, where I had spent most of my adolescent years.
2:08
So as someone who‘s been there for the last several decades and seen it all, certainly seen a lot of it, tell me a bit about how you‘re thinking about deep learning, about neural networks has evolved over this time?
2:22
We start with experiments, with intuitions, and theory sort of comes later. We now understand a lot better, for example, why Backdrop is working so well, why depth is so important. And these kinds of notions, we didn‘t have any solid justification for in those days. When we started working on deep nets in the early 2000s, we had the intuition that it made a lot of sense that a deeper network should be more powerful. But we didn‘t know how to take that and prove it, and of course, our experiments, initially, didn‘t work.
2:59
And actually, what were the most important things that you think turned out to be right? And what were the biggest surprises of what turned out to be wrong, compared to what we knew 30 years ago?
3:11
Sure, so one of the biggest mistakes I made was to think, like everyone else in the 90s, that you needed smooth nonlinearities in order for Backdrop to work. because I thought that if we had something like rectifying nonlinearities, where you have a flat part, that it would be really hard to train, because the derivative would be zero in so many places. And when we started experimenting with ReLU, with deep nets around 2010, I was obsessed with the idea that, we should be careful about whether neurons won‘t saturate too much on the zero part. But in the end, it turned out that, actually, the ReLU was working a lot better than the sigmoids and attach, and that was a big surprise. We did this, exploring this because of the biological connection, actually, not because we thought that it would be easier to optimize. But it turned out to work better, whereas I thought it would be harder to train. >> So let me ask you, what is the relationship between deep learning and the brain? There‘s the obvious answer, but I‘m curious what‘s your answer to that? >> Well, the initial insight that really got me excited with neural nets was this idea from the connectionists that information is distributed across the activation of many neurons. Rather than being represented by sort of the grandmother cell, as they were calling it, a symbolic representation. That was the traditional view in classical AI. And I still believe this is a really important thing, and I see people rediscovering the importance of that, even recently. So that was really a foundation. The depth thing is something that came later, in the early 2000s, but it wasn‘t something I was thinking about in the 90s, for example. >> Right, right, and I remember you built a lot of relatively shallow, but very distributed representations for the word embeddings, right, very early on. >> Right, that‘s right, yeah, that‘s one of the things that I got really excited about in the late 90s. Actually, my brother, Samy, and I worked on the idea that we could use neural nets to tackle the curse of dimensionality, which was believed to be one of the central issues with the statistical learning. And that fact that we could have these distributed presentations could be used to represent joint distributions over many random variables in a very efficient way. And it turned out to work quite well, and then I extended this to joint distributions over sequences of words, and this is how the word embeddings were born. Because I thought, this will allow generalization across words that have similar semantic meaning and so on. >> So over the last couple decades, your research group has invented more ideas than anyone can summarize in a few minutes. So I‘m curious, what are the inventions or ideas you‘re most proud of from your group? >> Right, so I think I mentioned long-term dependencies, the study of that. I think people still don‘t understand it well enough. Then there‘s the story I mentioned about curse of dimensionality, joint distributions with neural nets, which became, more recently, the that Hugo Larochelle did. And then, as I said, that gave rise to all sort of work on learning word embeddings for joint distributions for words. Then came, I think, probably the best known events of the work we did with deep learning, with stacks of auto encoders and stacks of RBMs.
7:09
One thing then, it was the work on understanding better the difficulties of training deep nets with with the initialization ideas, and also, the vanishing gradient in deep nets. And that work actually was the one which gave rise to the experiments showing the importance of piecewise linear activation functions. Then I would say some of the most important work regards the work we did with unsupervised learning, the denoising auto-encoders, the GANs, which are very popular these days, the generative adversarial networks. The work we did with neural machine translation using attention, which turned out to be really important for making translation work. And it‘s currently used in industrial systems, like Google Translate. But this attention thing actually really changed my views on neural nets. Neural nets we used to think as machines that can map a vector to a vector. But really with attention mechanisms, you can now handle any kind of data structure. And this is really opening up a lot of interesting avenues. Direction of actually connecting to biology, one thing that I‘ve been working on in the last couple of years is, how could we come up with something like backprop but that brains could implement. And we have a few papers in that direction that seems to be interesting for the neuroscience people. And then we‘re continuing in that direction of course.
8:47
One of the topics that I know you‘ve been thinking a lot about is the relationship between deep learning and the brain, can you tell us a bit more about that? >> The biological thing is something I‘ve been thinking about for a while actually and having a lot of, I would say daydreaming about. Because I think of it like a puzzle. So we have these pieces of evidence from what we know from the brain and from learning in the brain like spike timing dependent plasticity. And on the other hand, we have all of these concepts from machine learning. The idea of globally training the whole system with respect to an objective function, and the idea of backprop. And what does backprop mean? Like, what does credit assignment really mean? When I started thinking about how brains could do something like backprop, it prompted me to think about, well, maybe there‘s some more general concepts behind backprop which make it so efficient which allow us to be efficient with backprop. And maybe there‘s a larger family of ways to do credit assignment, and that connects to questions that people in reinforcement learning have been asking. So it‘s interesting how sometimes asking a simple question leads you to thinking about so many different things, and forces you to think about so many elements that you like to bring together like a big puzzle. So this has gone for a number of years. And I need to say that this whole endeavor, like many of the ones that I have followed, has been highly inspired by Jeff Hinton‘s thoughts. So in particular, he gave this talk in 2007 I think, the first deep learning workshop on what he thought was the way that the brain is working.
10:52
How kind of temporal code could be used for potentially doing some of the job of backprop. And that led to a lot of the ideas that I‘ve explored in recent years with this.
11:07
Yeah, so it‘s kind of an interesting story that has been
11:13
running for a decade now, basically.
11:17
One of the topics I‘ve heard you speak about multiple times as well is unsupervised learning. Can you share your perspective on that?
11:26
Yes, yes, so unsupervised learning is really important. Right now, our industrial systems are based on supervised learning, which essentially requires humans to define what the important concepts are for the problem and to label those concepts in the data. And we build all these amazing toys and services and systems using this. But humans are able to do much more. They are able to explore and discover new concepts by observation and interaction with the world. A two year old is able to understand intuitive physics. In other words, she understands gravity, she understands pressure, she understands inertia. She understands liquid, solids. And of course, her parents never told her about any of this stuff, right? So how did she figure it out? So that‘s the kind of question that unsupervised learning is trying to answer. It‘s not just about we have labels or we don‘t have labels. It‘s about actually building a mental construction that explains how the world works by observation. And more recently, I‘ve been combining the ideas in unsupervised learning with the ideas in reinforcement learning. Because I believe that there is a very strong indication about the important underlying concepts that we‘re trying to disentangle, we‘re trying to separate from each other.
12:58
That a human or machine can get by interacting with the world, by exploring the world and trying things and trying to control things. So these are I think tightly coupled to the original ideas of unsupervised learning. So my take on unsupervised learning, 15 years ago when we started doing the the and the RBMs and so on was very focused on the idea of learning good representations. And I still think this is an essential question. But the thing we don‘t know is how and what is a good representation? How do we figure out an objective function, for example? So we‘ve tried many things over the years. And that‘s actually one of the cool things about unsupervised learning research, that there are so many different ideas, so different ways that this problem can be attacked. And that‘s just, maybe there‘s another one we‘ll discover next year that‘s completely different and maybe the brain is using something else completely different. So it‘s not incremental research, it‘s something that in itself is very exploratory.
14:07
We don‘t have a good definition of what‘s the right objective function to even measure that a system is doing a good job on unsupervised learning. So of course, it‘s challenging, but at the same time, it leaves open a wide field of possibilities, which is what researchers really love, at least that‘s something that appeals to me.
14:28
So today, there‘s so much going on in deep learning. And I think we‘ve passed the point where it‘s possible for any one human to read every single deep learning paper being published.
14:38
So I‘m curious, what in deep learning today excites you the most? >> So I‘m very ambitious, and I feel like the current state of the science of deep learning is far from where I‘d like to see it. And I have the impression that our systems right now make the kind of mistakes that suggest they have a very superficial understanding of the world.
15:06
So what excites me the most now is sort of direction of research where we‘re not trying to build systems that are going to do something useful. We‘re just going back to principles about, how can a computer observe the world, interact with the world, and discover how that world works? Even if that world is simple, something that we can program as a kind of video game, we don‘t know how to do that well. And that‘s cool, because I don‘t have to compete with Google, and Facebook, and Baidu, and so on, right? Because this is a kind of basic research that can be done by anyone in their garage and could change the world. So there are many, of course, many directions to attack this. But I see a lot of the fruitful interactions between ideas in deep learning and reinforcement learning being really important there. And I‘m really excited that the progress in this direction Could have a huge impact on practical applications actually. Because if you look at some of the big challenges that we have in applications, like how we deal with new domains, or categories on which we have too few examples. And in cases where humans are very good at solving those problems. So these transfer learning and dramatization issues, they would become much easier to tackle if we had systems that had a better understanding of how the world works. A deeper understanding, right? What is actually going on? What are the causes of what I‘m seeing? And how could I influence what I‘m seeing by my actions? So these are the kinds of questions I‘m really excited about these days. I think the connect, also the deep learning research that has evolved over the last couple of decades with even older questions in AI. Because a lot of the success in deep learning has been with perception. So what‘s left, right? What‘s left is sort of high level condition, which is about understanding at an abstract level how things work. So we are program of understanding high level abstractions I think has not reached those high levels of abstractions and so we have to get there. We have to think about reasoning, about sequential processing of information. We have to think of how causality works and how machines can discover all these things by themselves. Potentially guided by humans, but as much as possible in an autonomous way. >> And it sounds like from part of what you said that you‘re a fan of research approaches where you experiment on, I‘m going to use term toy problem, not in a disparaging way. >> Right. >> But on the small problem. And you‘re optimistic that that transfers to bigger problems later. >> Yes, yes, it transfers in a way. Of course we‘re going to have to do some work to scale up and address those problems. But my main motivation for going for those toy problems is that we can understand better our failures and we can reduce the problem to something we can intuitively sort of manipulate and understand more easily. So sort of a classical divide and conquer science approach. And also, I think, something people don‘t think about it enough is the research cycle can be much faster, right? So if I can do an experiment in a few hours, I can progress much faster. If I have to try out a huge model that tries to capture the whole common sense and everything in the general knowledge, which eventually we‘ll do. It‘s just each experiment just takes too much time with current hardware. So while our hardware friends are building machines that are going to be a thousand or a million times faster, I‘m doing those toy experiments. [LAUGH] >> You know, I‘ve also heard you speak about the science of deep learning, not just as an engineering discipline, but doing more work to understand what‘s really going on. Do you want to share your thoughts on that? >> Yeah, absolutely. I fear that a lot of the work that we‘re doing is sort of like blind people trying to find their way. [LAUGH] And you can get a lot of luck and find interesting things that way. But really if we sort of stop a little bit and try to understand what we‘re doing in a way that‘s transferable, because we go down to principles to theory, but when I say theory I don‘t mean, necessarily, math. Of course I like math and so on, but I don‘t think that we need that everything be formalized mathematically but be formalized logically. In the sense that I can convince somebody that this should work, whether this make sense. This is the most important aspect. And then math allows us to make that stronger and tighter. But really it‘s more about understanding. And it‘s about also doing our research, not to be the next baseline, or benchmark, or beat the other guys in the other lab, or the other company. It‘s more about what kind of question should we ask that would allow us to understand better the phenomena of interest. What makes, for example, training in deeper networks harder, or current nets harder? We have some ideas, but a lot of things we don‘t understand yet.
20:49
So we can maybe design experiments whose goal is not to have a better algorithm, but just to understand better the algorithms we currently have or what circumstances make the particular algorithm work better and why. It‘s the why that really matters. That‘s what‘s science is about. It‘s why. >> Right. Today there are a lot of people that want to enter the field. And I‘m sure you‘ve answered this a lot in one-on-one settings, but with all the people watching this on video, what advice would you have for people that want to get into AI, get into deep learning? >> Right, so first of all, there are different motivations and different things you can do. What you need to become a deep learning researcher may not be the same as if you want to be an engineer who‘s going to use deep learning to build products. There‘s a different level of understanding that‘s needed in both cases. But in any case in both cases, practice. So to really master a subject like deep learning, of course you have to read a lot. You have to practice programming the things yourself.
21:58
Very often I interview students who have used software. And these days there‘s so many good software around that you can just plug and play and understand nothing of what you‘re doing. Or at such as a superficial level that then it becomes hard to figure out when it doesn‘t work and what‘s going wrong. So actually trying to implement things yourself, even if it‘s inefficient. But just to make sure you really understand what is going on is really useful, and trying things yourself. >> So don‘t just use one of the programming frameworks where you can do everything in a few lines of code, but you don‘t really know what just happened. >> Exactly, exactly, and I would say even more than that. Trying to derive the thing yourself from first principles, if you can. That really helps. But yeah, the usual things you have to do like reading, looking at other people‘s code, writing your own code, doing lots of experiment, making sure you understand everything you do. So especially for the science part of it, trying to ask why am I doing this, why are people doing this? Maybe the answer is somewhere in the book and you have to read more.
23:11
But it‘s even better if you can actually figure it out by yourself.
23:15
Yeah, cool, yeah. And in fact, of the things I read, you and Ian [INAUDIBLE] and Aaron [INAUDIBLE] wrote a highly regarded book. >> Thank you, thank you. Yes, it‘s selling a lot. It‘s a bit crazy. I feel like there is more people reading this book than people who can read it [LAUGH] right now. But yeah, also proceedings of the ICLR I conference is probably the best concentrated place of good papers. Of course there are really good papers at NIPS and ICML and other conferences. But if you really want to go for a lot of good papers, just read the last few ICLR proceedings, and that will give you really good view of the field. >> Cool, yeah. Any other thoughts? When people ask you for advice, how does someone become good at deep learning? >> Well, it depends on where you come from. Don‘t be afraid by the math. Just develop the intuitions, and then the math become really easier to understand once you get the hang of what‘s going on at the intuitive level. And one good news is that you don‘t need five years of PhD to become proficient at deep learning. You can actually learn pretty quickly. If you have a good background in computer science and math, you can learn enough to use it and build things and start research experiments in just a few months. Something like six months for people with the right training. Maybe they don‘t know anything about machine learning, but if they‘re good at math and computer science, it can be very fast. And of course, so that means you need to have the right training in math and computer science. Sometimes what you learn in just computer science courses is not enough. You need some continuous math, especially. So this is probability, algebra and optimization, for example. >> I see. And calculus. >> And calculus, yeah. >> Thanks a lot, Joshua, for sharing all of the comments and insights and advice. Even though I‘ve known you for a long time, there are many details of your early history that I didn‘t know until now, so thank you. >> Well, thank you, Andrew, for doing this special recording and what you‘re doing. I hope it‘s going to be used by a lot of people.
Optimization algorithms
Mini-batch gradient descent - 11m
0:00
Hello, and welcome back. In this week, you learn about optimization algorithms that will enable you to train your neural network much faster. You‘ve heard me say before that applying machine learning is a highly empirical process, is highly iterative process. In which you just had to train a lot of models to find one that works really well. So, it really helps to really train models quickly. One thing that makes it more difficult is that Deep Learning does not work best in a regime of big data. We are able to train neural networks on a huge data set and training on a large data set is just slow. So, what you find is that having fast optimization algorithms, having good optimization algorithms can really speed up the efficiency of you and your team. So, let‘s get started by talking about mini-batch gradient descent. You‘ve learned previously that vectorization allows you to efficiently compute on all m examples, that allows you to process your whole training set without an explicit formula. That‘s why we would take our training examples and stack them into these huge matrix capsule Xs. X1, X2, X3, and then eventually it goes up to X, M training samples. And similarly for Y this is Y1 and Y2, Y3 and so on up to YM. So, the dimension of X was an X by M and this was 1 by M. Vectorization allows you to process all M examples relatively quickly if M is very large then it can still be slow. For example what if M was 5 million or 50 million or even bigger. With the implementation of gradient descent on your whole training set, what you have to do is, you have to process your entire training set before you take one little step of gradient descent. And then you have to process your entire training sets of five million training samples again before you take another little step of gradient descent. So, it turns out that you can get a faster algorithm if you let gradient descent start to make some progress even before you finish processing your entire, your giant training sets of 5 million examples. In particular, here‘s what you can do. Let‘s say that you split up your training set into smaller, little baby training sets and these baby training sets are called mini-batches. And let‘s say each of your baby training sets have just 1,000 examples each. So, you take X1 through X1,000 and you call that your first little baby training set, also call the mini-batch. And then you take home the next 1,000 examples. X1,001 through X2,000 and then X1,000 examples and come next one and so on. I‘m going to introduce a new notation I‘m going to call this X superscript with curly braces, 1 and I am going to call this, X superscript with curly braces, 2. Now, if you have 5 million training samples total and each of these little mini batches has a thousand examples, that means you have 5,000 of these because you know 5,000 times 1,000 equals 5 million. Altogether you would have 5,000 of these mini batches. So it ends with X superscript curly braces 5,000 and then similarly you do the same thing for Y. You would also split up your training data for Y accordingly. So, call that Y1 then this is Y1,001 through Y2,000. This is called, Y2 and so on until you have Y5,000. Now, mini batch number T is going to be comprised of X, T and Y, T. And that is a thousand training samples with the corresponding input output pairs. Before moving on, just to make sure my notation is clear, we have previously used superscript round brackets I to index in the training set so X I, is the I training sample. We use superscript, square brackets L to index into the different layers of the neural network. So, ZL comes from the Z value, the L layer of the neural network and here we are introducing the curly brackets T to index into different mini batches. So, you have XT, YT and to check your understanding of these, what is the dimension of XT and YT? Well, X is an X by M. So, if X1 is a thousand training examples or the X values for a thousand examples, then this dimension should be MX by 1,000 and X2 should also be an X by 1,000 and so on. So, all of these should have dimension MX by 1,000 and these should have dimension 1 by 1,000. To explain the name of this algorithm, batch gradient descent, refers to the gradient descent algorithm we have been talking about previously. Where you process your entire training set all at the same time. And the name comes from viewing that as processing your entire batch of training samples all at the same time. I know it‘s not a great name but that‘s just what it‘s called. Mini-batch gradient descent in contrast, refers to algorithm which we‘ll talk about on the next slide and which you process is single mini batch XT, YT at the same time rather than processing your entire training set XY the same time. So, let‘s see how mini-batch gradient descent works. To run mini-batch gradient descent on your training sets you run for T equals 1 to 5,000 because we had 5,000 mini batches as high as 1,000 each. What are you going to do inside the for loop is basically implement one step of gradient descent using XT comma YT. It is as if you had a training set of size 1,000 examples and it was as if you were to implement the overall you are already familiar with but just on this little training set size of M equals 1,000 rather than having an explicit for loop over all 1,000 examples, you would use vectorization to process all 1,000 examples sort of all at the same time. Let us write this out first, you implemented for a prop on the inputs. So just on XT and you do that by implementing Z1 equals W1. Previously, we would just have X there, right? But now you are processing the entire training set, you are just processing the first mini-batch so that it becomes XT when you‘re processing mini-batch T. Then you will have A1 equals G1 of Z1, a capital Z since this is actually a vectorizing connotation and so on until you end up with AL, answer is GL of ZL and then this is your prediction. And you notice that here you should use a vectorized implementation. It‘s just that this vectorized implementation processes 1,000 examples at a time rather than 5 million examples. Next you compute the cost function J which I‘m going to write as one over 1,000 since here 1,000 is the size of your little training set. Sum from I equals one through L of really the loss of YI and this notation for clarity, refers to examples from the mini batch XT YT. And if you‘re using regularization, you can also have this regularization term. Move it to the denominator times sum of L, Frobenius on the way makes it a square. Because this is really the cost on just one mini-batch, I‘m going to index as cost J with a superscript T in curly braces. You notice that everything we are doing is exactly the same as when we were previously implementing gradient descent except that instead of doing it on XY, you‘re not doing it on XT YT. Next, you implement that prop to compute gradients with respect to JT, you are still using only XT YT and then you update the weights W, read WL gets updated as WL minus alpha D WL and similarly for B. This is one pass through your training set using mini-batch gradient descent. The code I have written down here is also called doing one epoch of training and epoch is a word that means a single pass through the training set. Whereas with batch gradient descent, a single pass through the training allows you to take only one gradient descent step. With mini-batch gradient descent, a single pass through the training set, that is one epoch, allows you to take 5,000 gradient descent steps. Now of course you want to take multiple passes through the training set which you usually want to, you might want another for loop for another while loop out there. So you keep taking passes through the training set until hopefully you converge with approximately converge. When you have a lost training set, mini-batch gradient descent runs much faster than batch gradient descent and that‘s pretty much what everyone in Deep Learning will use when you‘re training on a large data set. In the next video, let‘s delve deeper into mini-batch gradient descent so you can get a better understanding of what it is doing and why it works so well.
Understanding mini-batch gradient descent - 11m
0:00
In the previous video, you saw how you can use mini-batch gradient descent to start making progress and start taking gradient descent steps, even when you‘re just partway through processing your training set even for the first time. In this video, you learn more details of how to implement gradient descent and gain a better understanding of what it‘s doing and why it works. With batch gradient descent on every iteration you go through the entire training set and you‘d expect the cost to go down on every single iteration.
0:30
So if we‘ve had the cost function j as a function of different iterations it should decrease on every single iteration. And if it ever goes up even on iteration then something is wrong. Maybe you‘re running ways to big. On mini batch gradient descent though, if you plot progress on your cost function, then it may not decrease on every iteration. In particular, on every iteration you‘re processing some X{t}, Y{t} and so if you plot the cost function J{t}, which is computer using just X{t}, Y{t}. Then it‘s as if on every iteration you‘re training on a different training set or really training on a different mini batch. So you plot the cross function J, you‘re more likely to see something that looks like this. It should trend downwards, but it‘s also going to be a little bit noisier.
1:30
So if you plot J{t}, as you‘re training mini batch in descent it may be over multiple epochs, you might expect to see a curve like this. So it‘s okay if it doesn‘t go down on every derivation. But it should trend downwards, and the reason it‘ll be a little bit noisy is that, maybe X{1}, Y{1} is just the rows of easy mini batch so your cost might be a bit lower, but then maybe just by chance, X{2}, Y{2} is just a harder mini batch. Maybe you needed some mislabeled examples in it, in which case the cost will be a bit higher and so on. So that‘s why you get these oscillations as you plot the cost when you‘re running mini batch gradient descent. Now one of the parameters you need to choose is the size of your mini batch. So m was the training set size on one extreme, if the mini-batch size,
2:26
= m, then you just end up with batch gradient descent.
2:36
Al lright, so in this extreme you would just have one mini-batch X{1}, Y{1}, and this mini-batch is equal to your entire training set. So setting a mini-batch size m just gives you batch gradient descent. The other extreme would be if your mini-batch size, Were = 1.
2:59
This gives you an algorithm called stochastic gradient descent.
3:07
And here every example is its own mini-batch.
3:18
So what you do in this case is you look at the first mini-batch, so X{1}, Y{1}, but when your mini-batch size is one, this just has your first training example, and you take derivative to sense that your first training example. And then you next take a look at your second mini-batch, which is just your second training example, and take your gradient descent step with that, and then you do it with the third training example and so on looking at just one single training sample at the time.
3:50
So let‘s look at what these two extremes will do on optimizing this cost function. If these are the contours of the cost function you‘re trying to minimize so your minimum is there. Then batch gradient descent might start somewhere and be able to take relatively low noise, relatively large steps. And you could just keep matching to the minimum. In contrast with stochastic gradient descent If you start somewhere let‘s pick a different starting point. Then on every iteration you‘re taking gradient descent with just a single strain example so most of the time you hit two at the global minimum. But sometimes you hit in the wrong direction if that one example happens to point you in a bad direction. So stochastic gradient descent can be extremely noisy. And on average, it‘ll take you in a good direction, but sometimes it‘ll head in the wrong direction as well. As stochastic gradient descent won‘t ever converge, it‘ll always just kind of oscillate and wander around the region of the minimum. But it won‘t ever just head to the minimum and stay there. In practice, the mini-batch size you use will be somewhere in between.
5:07
Somewhere between in 1 and m and 1 and m are respectively too small and too large. And here‘s why. If you use batch grading descent, So this is your mini batch size equals m.
5:30
Then you‘re processing a huge training set on every iteration. So the main disadvantage of this is that it takes too much time too long per iteration assuming you have a very long training set. If you have a small training set then batch gradient descent is fine. If you go to the opposite, if you use stochastic gradient descent,
5:54
Then it‘s nice that you get to make progress after processing just tone example that‘s actually not a problem. And the noisiness can be ameliorated or can be reduced by just using a smaller learning rate. But a huge disadvantage to stochastic gradient descent is that you lose almost all your speed up from vectorization.
6:18
Because, here you‘re processing a single training example at a time. The way you process each example is going to be very inefficient. So what works best in practice is something in between where you have some,
6:36
Mini-batch size not to big or too small.
6:44
And this gives you in practice the fastest learning.
6:51
And you notice that this has two good things going for it. One is that you do get a lot of vectorization. So in the example we used on the previous video, if your mini batch size was 1000 examples then, you might be able to vectorize across 1000 examples which is going to be much faster than processing the examples one at a time.
7:13
And second, you can also make progress,
7:22
Without needing to wait til you process the entire training set.
7:32
So again using the numbers we have from the previous video, each epoco each part your training set allows you to see 5,000 gradient descent steps.
7:41
So in practice they‘ll be some in-between mini-batch size that works best. And so with mini-batch gradient descent we‘ll start here, maybe one iteration does this, two iterations, three, four. And It‘s not guaranteed to always head toward the minimum but it tends to head more consistently in direction of the minimum than the consequent descent. And then it doesn‘t always exactly convert or oscillate in a very small region. If that‘s an issue you can always reduce the learning rate slowly. We‘ll talk more about learning rate decay or how to reduce the learning rate in a later video. So if the mini-batch size should not be m and should not be 1 but should be something in between, how do you go about choosing it? Well, here are some guidelines. First, if you have a small training set, Just use batch gradient descent.
8:36
If you have a small training set then no point using mini-batch gradient descent you can process a whole training set quite fast. So you might as well use batch gradient descent. What a small training set means, I would say if it‘s less than maybe 2000 it‘d be perfectly fine to just use batch gradient descent. Otherwise, if you have a bigger training set, typical mini batch sizes would be,
9:03
Anything from 64 up to maybe 512 are quite typical. And because of the way computer memory is layed out and accessed, sometimes your code runs faster if your mini-batch size is a power of 2. All right, so 64 is 2 to the 6th, is 2 to the 7th, 2 to the 8, 2 to the 9, so often I‘ll implement my mini-batch size to be a power of 2. I know that in a previous video I used a mini-batch size of 1000, if you really wanted to do that I would recommend you just use your 1024, which is 2 to the power of 10. And you do see mini batch sizes of size 1024, it is a bit more rare. This range of mini batch sizes, a little bit more common. One last tip is to make sure that your mini batch,
9:57
All of your X{t}, Y{t} that that fits in CPU/GPU memory.
10:08
And this really depends on your application and how large a single training sample is. But if you ever process a mini-batch that doesn‘t actually fit in CPU, GPU memory, whether you‘re using the process, the data. Then you find that the performance suddenly falls of a cliff and is suddenly much worse. So I hope this gives you a sense of the typical range of mini batch sizes that people use. In practice of course the mini batch size is another hyper parameter that you might do a quick search over to try to figure out which one is most sufficient of reducing the cost function j. So what i would do is just try several different values. Try a few different powers of two and then see if you can pick one that makes your gradient descent optimization algorithm as efficient as possible. But hopefully this gives you a set of guidelines for how to get started with that hyper parameter search. You now know how to implement mini-batch gradient descent and make your algorithm run much faster, especially when you‘re training on a large training set. But it turns out there‘re even more efficient algorithms than gradient descent or mini-batch gradient descent. Let‘s start talking about them in the next few videos.
Exponentially weighted averages - 5m
0:00
I want to show you a few optimization algorithms. They are faster than gradient descent. In order to understand those algorithms, you need to be able they use something called exponentially weighted averages. Also called exponentially weighted moving averages in statistics. Let‘s first talk about that, and then we‘ll use this to build up to more sophisticated optimization algorithms. So, even though I now live in the United States, I was born in London. So, for this example I got the daily temperature from London from last year. So, on January 1, temperature was 40 degrees Fahrenheit. Now, I know most of the world uses a Celsius system, but I guess I live in United States which uses Fahrenheit. So that‘s four degrees Celsius. And on January 2, it was nine degrees Celsius and so on. And then about halfway through the year, a year has 365 days so, that would be, sometime day number 180 will be sometime in late May, I guess. It was 60 degrees Fahrenheit which is 15 degrees Celsius, and so on. So, it start to get warmer, towards summer and it was colder in January. So, you plot the data you end up with this. Where day one being sometime in January, that you know, being the, beginning of summer, and that‘s the end of the year, kind of late December. So, this would be January, January 1, is the middle of the year approaching summer, and this would be the data from the end of the year. So, this data looks a little bit noisy and if you want to compute the trends, the local average or a moving average of the temperature, here‘s what you can do. Let‘s initialize V zero equals zero. And then, on every day, we‘re going to average it with a weight of 0.9 times whatever appears as value, plus 0.1 times that day temperature. So, theta one here would be the temperature from the first day. And on the second day, we‘re again going to take a weighted average. 0.9 times the previous value plus 0.1 times today‘s temperature and so on. Day two plus 0.1 times theta three and so on. And the more general formula is V on a given day is 0.9 times V from the previous day, plus 0.1 times the temperature of that day. So, if you compute this and plot it in red, this is what you get. You get a moving average of what‘s called an exponentially weighted average of the daily temperature. So, let‘s look at the equation we had from the previous slide, it was VT equals, previously we had 0.9. We‘ll now turn that to prime to beta, beta times VT minus one plus and it previously, was 0.1, I‘m going to turn that into one minus beta times theta T, so, previously you had beta equals 0.9. It turns out that for reasons we are going to later, when you compute this you can think of VT as approximately averaging over, something like one over one minus beta, day‘s temperature. So, for example when beta goes 0.9 you could think of this as averaging over the last 10 days temperature. And that was the red line. Now, let‘s try something else. Let‘s set beta to be very close to one, let‘s say it‘s 0.98. Then, if you look at 1/1 minus 0.98, this is equal to 50. So, this is, you know, think of this as averaging over roughly, the last 50 days temperature. And if you plot that you get this green line. So, notice a couple of things with this very high value of beta. The plot you get is much smoother because you‘re now averaging over more days of temperature. So, the curve is just, you know, less wavy is now smoother, but on the flip side the curve has now shifted further to the right because you‘re now averaging over a much larger window of temperatures. And by averaging over a larger window, this formula, this exponentially weighted average formula. It adapts more slowly, when the temperature changes. So, there‘s just a bit more latency. And the reason for that is when Beta 0.98 then it‘s giving a lot of weight to the previous value and a much smaller weight just 0.02, to whatever you‘re seeing right now. So, when the temperature changes, when temperature goes up or down, there‘s exponentially weighted average. Just adapts more slowly when beta is so large. Now, let‘s try another value. If you set beta to another extreme, let‘s say it is 0.5, then this by the formula we have on the right. This is something like averaging over just two days temperature, and you plot that you get this yellow line. And by averaging only over two days temperature, you have a much, as if you‘re averaging over much shorter window. So, you‘re much more noisy, much more susceptible to outliers. But this adapts much more quickly to what the temperature changes. So, this formula is highly implemented, exponentially weighted average. Again, it‘s called an exponentially weighted, moving average in the statistics literature. We‘re going to call it exponentially weighted average for short and by varying this parameter or later we‘ll see such a hyper parameter if you‘re learning algorithm you can get slightly different effects and there will usually be some value in between that works best. That gives you the red curve which you know maybe looks like a beta average of the temperature than either the green or the yellow curve. You now know the basics of how to compute exponentially weighted averages. In the next video, let‘s get a bit more intuition about what it‘s doing.
Understanding exponentially weighted averages - 9m
0:00
In the last video, we talked about exponentially weighted averages. This will turn out to be a key component of several optimization algorithms that you used to train your neural networks. So, in this video, I want to delve a little bit deeper into intuitions for what this algorithm is really doing. Recall that this is a key equation for implementing exponentially weighted averages. And so, if beta equals 0.9 you got the red line. If it was much closer to one, if it was 0.98, you get the green line. And it it‘s much smaller, maybe 0.5, you get the yellow line. Let‘s look a bit more than that to understand how this is computing averages of the daily temperature. So here‘s that equation again, and let‘s set beta equals 0.9 and write out a few equations that this corresponds to. So whereas, when you‘re implementing it you have T going from zero to one, to two to three, increasing values of T. To analyze it, I‘ve written it with decreasing values of T. And this goes on. So let‘s take this first equation here, and understand what V100 really is. So V100 is going to be, let me reverse these two terms, it‘s going to be 0.1 times theta 100, plus 0.9 times whatever the value was on the previous day. Now, but what is V99? Well, we‘ll just plug it in from this equation. So this is just going to be 0.1 times theta 99, and again I‘ve reversed these two terms, plus 0.9 times V98. But then what is V98? Well, you just get that from here. So you can just plug in here, 0.1 times theta 98, plus 0.9 times V97, and so on. And if you multiply all of these terms out, you can show that V100 is 0.1 times theta 100 plus. Now, let‘s look at coefficient on theta 99, it‘s going to be 0.1 times 0.9, times theta 99. Now, let‘s look at the coefficient on theta 98, there‘s a 0.1 here times 0.9, times 0.9. So if we expand out the Algebra, this become 0.1 times 0.9 squared, times theta 98. And, if you keep expanding this out, you find that this becomes 0.1 times 0.9 cubed, theta 97 plus 0.1, times 0.9 to the fourth, times theta 96, plus dot dot dot. So this is really a way to sum and that‘s a weighted average of theta 100, which is the current days temperature and we‘re looking for a perspective of V100 which you calculate on the 100th day of the year. But those are sum of your theta 100, theta 99, theta 98, theta 97, theta 96, and so on. So one way to draw this in pictures would be if, let‘s say we have some number of days of temperature. So this is theta and this is T. So theta 100 will be sum value, then theta 99 will be sum value, theta 98, so these are, so this is T equals 100, 99, 98, and so on, ratio of sum number of days of temperature. And what we have is then an exponentially decaying function. So starting from 0.1 to 0.9, times 0.1 to 0.9 squared, times 0.1, to and so on. So you have this exponentially decaying function. And the way you compute V100, is you take the element wise product between these two functions and sum it up. So you take this value, theta 100 times 0.1, times this value of theta 99 times 0.1 times 0.9, that‘s the second term and so on. So it‘s really taking the daily temperature, multiply with this exponentially decaying function, and then summing it up. And this becomes your V100. It turns out that, up to details that are for later. But all of these coefficients, add up to one or add up to very close to one, up to a detail called bias correction which we‘ll talk about in the next video. But because of that, this really is an exponentially weighted average. And finally, you might wonder, how many days temperature is this averaging over. Well, it turns out that 0.9 to the power of 10, is about 0.35 and this turns out to be about one over E, one of the base of natural algorithms. And, more generally, if you have one minus epsilon, so in this example, epsilon would be 0.1, so if this was 0.9, then one minus epsilon to the one over epsilon. This is about one over E, this about 0.34, 0.35. And so, in other words, it takes about 10 days for the height of this to decay to around 1/3 already one over E of the peak. So it‘s because of this, that when beta equals 0.9, we say that, this is as if you‘re computing an exponentially weighted average that focuses on just the last 10 days temperature. Because it‘s after 10 days that the weight decays to less than about a third of the weight of the current day. Whereas, in contrast, if beta was equal to 0.98, then, well, what do you need 0.98 to the power of in order for this to really small? Turns out that 0.98 to the power of 50 will be approximately equal to one over E. So the way to be pretty big will be bigger than one over E for the first 50 days, and then they‘ll decay quite rapidly over that. So intuitively, this is the hard and fast thing, you can think of this as averaging over about 50 days temperature. Because, in this example, to use the notation here on the left, it‘s as if epsilon is equal to 0.02, so one over epsilon is 50. And this, by the way, is how we got the formula, that we‘re averaging over one over one minus beta or so days. Right here, epsilon replace a row of 1 minus beta. It tells you, up to some constant roughly how many days temperature you should think of this as averaging over. But this is just a rule of thumb for how to think about it, and it isn‘t a formal mathematical statement. Finally, let‘s talk about how you actually implement this. Recall that we start over V0 initialized as zero, then compute V one on the first day, V2, and so on. Now, to explain the algorithm, it was useful to write down V0, V1, V2, and so on as distinct variables. But if you‘re implementing this in practice, this is what you do: you initialize V to be called to zero, and then on day one, you would set V equals beta, times V, plus one minus beta, times theta one. And then on the next day, you add update V, to be called to beta V, plus 1 minus beta, theta 2, and so on. And some of it uses notation V subscript theta to denote that V is computing this exponentially weighted average of the parameter theta. So just to say this again but for a new format, you set V theta equals zero, and then, repeatedly, have one each day, you would get next theta T, and then set to V, theta gets updated as beta, times the old value of V theta, plus one minus beta, times the current value of V theta. So one of the advantages of this exponentially weighted average formula, is that it takes very little memory. You just need to keep just one row number in computer memory, and you keep on overwriting it with this formula based on the latest values that you got. And it‘s really this reason, the efficiency, it just takes up one line of code basically and just storage and memory for a single row number to compute this exponentially weighted average. It‘s really not the best way, not the most accurate way to compute an average. If you were to compute a moving window, where you explicitly sum over the last 10 days, the last 50 days temperature and just divide by 10 or divide by 50, that usually gives you a better estimate. But the disadvantage of that, of explicitly keeping all the temperatures around and sum of the last 10 days is it requires more memory, and it‘s just more complicated to implement and is computationally more expensive. So for things, we‘ll see some examples on the next few videos, where you need to compute averages of a lot of variables. This is a very efficient way to do so both from computation and memory efficiency point of view which is why it‘s used in a lot of machine learning. Not to mention that there‘s just one line of code which is, maybe, another advantage. So, now, you know how to implement exponentially weighted averages. There‘s one more technical detail that‘s worth for you knowing about called bias correction. Let‘s see that in the next video, and then after that, you will use this to build a better optimization algorithm than the straight forward create
Bias correction in exponentially weighted averages - 4m
0:00
You‘ve learned how to implement exponentially weighted averages. There‘s one technical detail called biased correction that can make you computation of these averages more accurately. Let‘s see how that works. In a previous video, you saw this figure for beta = 0.9. This figure for beta = 0.98. But it turns out that if you implement the formula as written here, you won‘t actually get the green curve when, say, beta = 0.98. You actually get the purple curve here. And you notice that the purple curve starts off really low. So let‘s see how it affects that. When you‘re implementing a moving average, you initialize it with v0 = 0, and then v1 = 0.98 V0 + 0.02 theta 1. But V0 is equal to 0 so that term just goes away. So V1 is just 0.02 times theta 1. So that‘s why if the first day‘s temperature is, say 40 degrees Fahrenheit, then v1 will be 0.02 times 40, which is 8. So you get a much lower value down here. So it‘s not a very good estimate of the first day‘s temperature. v2 will be 0.98 times v1 + 0.02 times theta 2. And if you plug in v1, which is this down here and multiply it out, then you find that v2 is actually equal to 0.98 times 0.02 times theta 1 + 0.02 times theta 2. And that 0.0 196 theta1 + 0.02 theta2. So again, assuming theta1 and theta2 are positive numbers, when you compute this v2 will be much less than theta1 or theta2. So v2 isn‘t a very good estimate of the first two days‘ temperature of the year. So it turns out that there is a way to modify this estimate that makes it much better, that makes it more accurate, especially during this initial phase of your estimate. Which is that, instead of taking Vt, take Vt divided by 1- Beta to the power of t where t is the current data here on. So let‘s take a concrete example. When t = 2, 1- beta to the power of t is 1- 0.98 squared and it urns out that this is 0.0396. And so your estimate of the tempature on day 2 becomes v2 divided by 0.0396 and this is going to be 0.0196 times theta 1 + 0.02 theta 2. You notice that these two things adds up to the denominator 0.03 and 6. And so this becomes a weighted average of theta 1 and theta 2 and this removes this bias. So you notice that as t becomes large, beta to the t will
3:11
approach 0 which is why when t is large enough, the bias correction makes almost no difference. This is why when t is large, the purple line and the green line pretty much overlap. But during this initial phase of learning when you‘re still warming up your estimates when the bias correction can help you to obtain a better estimate of this temperature. And it is this bias correction that helps you go from the purple line to the green line. So in machine learning, for most implementations of the exponential weighted average, people don‘t often bother to implement bias corrections. Because most people would rather just wait that initial period and have a slightly more biased estimate and go from there. But if you are concerned about the bias during this initial phase, while your exponentially weighted moving average is still warming up. Then bias correction can help you get a better estimate early on. So you now know how to implement exponentially weighted moving averages. Let‘s go on and use this to build some better optimization algorithms.
Gradient descent with momentum - 9m
0:00
There‘s an algorithm called momentum, or gradient descent with momentum that almost always works faster than the standard gradient descent algorithm. In one sentence, the basic idea is to compute an exponentially weighted average of your gradients, and then use that gradient to update your weights instead. In this video, let‘s unpack that one sentence description and see how you can actually implement this. As a example let‘s say that you‘re trying to optimize a cost function which has contours like this. So the red dot denotes the position of the minimum. Maybe you start gradient descent here and if you take one iteration of gradient descent either or descent maybe end up -heading there. But now you‘re on the other side of this ellipse, and if you take another step of gradient descent maybe you end up doing that. And then another step, another step, and so on. And you see that gradient descents will sort of take a lot of steps, right? Just slowly oscillate toward the minimum. And this up and down oscillations slows down gradient descent and prevents you from using a much larger learning rate. In particular, if you were to use a much larger learning rate you might end up over shooting and end up diverging like so. And so the need to prevent the oscillations from getting too big forces you to use a learning rate that‘s not itself too large. Another way of viewing this problem is that on the vertical axis you want your learning to be a bit slower, because you don‘t want those oscillations. But on the horizontal axis, you want faster learning.
1:45
Right, because you want it to aggressively move from left to right, toward that minimum, toward that red dot. So here‘s what you can do if you implement gradient descent with momentum.
1:58
On each iteration, or more specifically, during iteration t you would compute the usual derivatives dw, db. I‘ll omit the superscript square bracket l‘s but you compute dw, db on the current mini-batch. And if you‘re using batch gradient descent, then the current mini-batch would be just your whole batch. And this works as well off a batch gradient descent. So if your current mini-batch is your entire training set, this works fine as well. And then what you do is you compute vdW to be Beta vdw plus 1 minus Beta dW. So this is similar to when we‘re previously computing the theta equals beta v theta plus 1 minus beta theta t.
2:57
Right, so it‘s computing a moving average of the derivatives for w you‘re getting. And then you similarly compute vdb equals that plus 1 minus Beta times db. And then you would update your weights using W gets updated as W minus the learning rate times, instead of updating it with dW, with the derivative, you update it with vdW. And similarly, b gets updated as b minus alpha times vdb. So what this does is smooth out the steps of gradient descent.
3:41
For example, let‘s say that in the last few derivatives you computed were this, this, this, this, this.
3:48
If you average out these gradients, you find that the oscillations in the vertical direction will tend to average out to something closer to zero. So, in the vertical direction, where you want to slow things down, this will average out positive and negative numbers, so the average will be close to zero. Whereas, on the horizontal direction, all the derivatives are pointing to the right of the horizontal direction, so the average in the horizontal direction will still be pretty big. So that‘s why with this algorithm, with a few iterations you find that the gradient descent with momentum ends up eventually just taking steps that are much smaller oscillations in the vertical direction, but are more directed to just moving quickly in the horizontal direction. And so this allows your algorithm to take a more straightforward path, or to damp out the oscillations in this path to the minimum. One intuition for this momentum which works for some people, but not everyone is that if you‘re trying to minimize your bowl shape function, right? This is really the contours of a bowl. I guess I‘m not very good at drawing. They kind of minimize this type of bowl shaped function then these derivative terms you can think of as providing acceleration to a ball that you‘re rolling down hill. And these momentum terms you can think of as representing the velocity.
5:20
And so imagine that you have a bowl, and you take a ball and the derivative imparts acceleration to this little ball as the little ball is rolling down this hill, right? And so it rolls faster and faster, because of acceleration. And data, because this number a little bit less than one, displays a row of friction and it prevents your ball from speeding up without limit. But so rather than gradient descent, just taking every single step independently of all previous steps. Now, your little ball can roll downhill and gain momentum, but it can accelerate down this bowl and therefore gain momentum. I find that this ball rolling down a bowl analogy, it seems to work for some people who enjoy physics intuitions. But it doesn‘t work for everyone, so if this analogy of a ball rolling down the bowl doesn‘t work for you, don‘t worry about it. Finally, let‘s look at some details on how you implement this. Here‘s the algorithm and so you now have two
6:22
hyperparameters of the learning rate alpha, as well as this parameter Beta, which controls your exponentially weighted average. The most common value for Beta is 0.9. We‘re averaging over the last ten days temperature. So it is averaging of the last ten iteration‘s gradients. And in practice, Beta equals 0.9 works very well. Feel free to try different values and do some hyperparameter search, but 0.9 appears to be a pretty robust value. Well, and how about bias correction, right? So do you want to take vdW and vdb and divide it by 1 minus beta to the t. In practice, people don‘t usually do this because after just ten iterations, your moving average will have warmed up and is no longer a bias estimate. So in practice, I don‘t really see people bothering with bias correction when implementing gradient descent or momentum. And of course this process initialize the vdW equals 0. Note that this is a matrix of zeroes with the same dimension as dW, which has the same dimension as W. And Vdb is also initialized to a vector of zeroes. So, the same dimension as db, which in turn has same dimension as b. Finally, I just want to mention that if you read the literature on gradient descent with momentum often you see it with this term omitted, with this 1 minus Beta term omitted. So you end up with vdW equals Beta vdw plus dW. And the net effect of using this version in purple is that vdW ends up being scaled by a factor of 1 minus Beta, or really 1 over 1 minus Beta. And so when you‘re performing these gradient descent updates, alpha just needs to change by a corresponding value of 1 over 1 minus Beta. In practice, both of these will work just fine, it just affects what‘s the best value of the learning rate alpha. But I find that this particular formulation is a little less intuitive. Because one impact of this is that if you end up tuning the hyperparameter Beta, then this affects the scaling of vdW and vdb as well. And so you end up needing to retune the learning rate, alpha, as well, maybe. So I personally prefer the formulation that I have written here on the left, rather than leaving out the 1 minus Beta term. But, so I tend to use the formula on the left, the printed formula with the 1 minus Beta term. But both versions having Beta equal 0.9 is a common choice of hyper parameter. It‘s just at alpha the learning rate would need to be tuned differently for these two different versions. So that‘s it for gradient descent with momentum. This will almost always work better than the straightforward gradient descent algorithm without momentum. But there‘s still other things we could do to speed up your learning algorithm. Let‘s continue talking about these in the next couple videos.
RMSprop - 7m
0:00
You‘ve seen how using momentum can speed up gradient descent. There‘s another algorithm called RMSprop, which stands for root mean square prop, that can also speed up gradient descent. Let‘s see how it works. Recall our example from before, that if you implement gradient descent, you can end up with huge oscillations in the vertical direction, even while it‘s trying to make progress in the horizontal direction. In order to provide intuition for this example, let‘s say that the vertical axis is the parameter b and horizontal axis is the parameter w. It could be w1 and w2 where some of the center parameters was named as b and w for the sake of intuition. And so, you want to slow down the learning in the b direction, or in the vertical direction. And speed up learning, or at least not slow it down in the horizontal direction. So this is what the RMSprop algorithm does to accomplish this. On iteration t, it will compute as usual the derivative dW, db on the current mini-batch.
1:15
So I was going to keep this exponentially weighted average. Instead of VdW, I‘m going to use the new notation SdW. So SdW is equal to beta times their previous value + 1- beta times dW squared. Sometimes [INAUDIBLE] as dW squared. So for clarity, this squaring operation is an element-wise squaring operation. So what this is doing is really keeping an exponentially weighted average of the squares of the derivatives. And similarly, we also have Sdb equals beta Sdb + 1- beta, db squared. And again, the squaring is an element-wise operation. Next, RMSprop then updates the parameters as follows. W gets updated as W minus the learning rate, and whereas previously we had alpha times dW, now it‘s dW divided by square root of SdW. And b gets updated as b minus the learning rate times, instead of just the gradient, this is also divided by, now divided by Sdb.
2:39
So let‘s gain some intuition about how this works. Recall that in the horizontal direction or in this example, in the W direction we want learning to go pretty fast. Whereas in the vertical direction or in this example in the b direction, we want to slow down all the oscillations into the vertical direction. So with this terms SdW an Sdb, what we‘re hoping is that SdW will be relatively small, so that here we‘re dividing by relatively small number. Whereas Sdb will be relatively large, so that here we‘re dividing yt relatively large number in order to slow down the updates on a vertical dimension. And indeed if you look at the derivatives, these derivatives are much larger in the vertical direction than in the horizontal direction. So the slope is very large in the b direction, right? So with derivatives like this, this is a very large db and a relatively small dw. Because the function is sloped much more steeply in the vertical direction than as in the b direction, than in the w direction, than in horizontal direction. And so, db squared will be relatively large. So Sdb will relatively large, whereas compared to that dW will be smaller, or dW squared will be smaller, and so SdW will be smaller. So the net effect of this is that your up days in the vertical direction are divided by a much larger number, and so that helps damp out the oscillations. Whereas the updates in the horizontal direction are divided by a smaller number. So the net impact of using RMSprop is that your updates will end up looking more like this.
4:22
That your updates in the, Vertical direction and then horizontal direction you can keep going. And one effect of this is also that you can therefore use a larger learning rate alpha, and get faster learning without diverging in the vertical direction. Now just for the sake of clarity, I‘ve been calling the vertical and horizontal directions b and w, just to illustrate this. In practice, you‘re in a very high dimensional space of parameters, so maybe the vertical dimensions where you‘re trying to damp the oscillation is a sum set of parameters, w1, w2, w17. And the horizontal dimensions might be w3, w4 and so on, right?. And so, the separation there‘s a WMP is just an illustration. In practice, dW is a very high-dimensional parameter vector. Db is also very high-dimensional parameter vector, but your intuition is that in dimensions where you‘re getting these oscillations, you end up computing a larger sum. A weighted average for these squares and derivatives, and so you end up dumping ] out the directions in which there are these oscillations. So that‘s RMSprop, and it stands for root mean squared prop, because here you‘re squaring the derivatives, and then you take the square root here at the end. So finally, just a couple last details on this algorithm before we move on.
5:49
In the next video, we‘re actually going to combine RMSprop together with momentum. So rather than using the hyperparameter beta, which we had used for momentum, I‘m going to call this hyperparameter beta 2 just to not clash. The same hyperparameter for both momentum and for RMSprop. And also to make sure that your algorithm doesn‘t divide by 0. What if square root of SdW, right, is very close to 0. Then things could blow up. Just to ensure numerical stability, when you implement this in practice you add a very, very small epsilon to the denominator. It doesn‘t really matter what epsilon is used. 10 to the -8 would be a reasonable default, but this just ensures slightly greater numerical stability that for numerical round off or whatever reason, that you don‘t end up dividing by a very, very small number. So that‘s RMSprop, and similar to momentum, has the effects of damping out the oscillations in gradient descent, in mini-batch gradient descent. And allowing you to maybe use a larger learning rate alpha. And certainly speeding up the learning speed of your algorithm. So now you know to implement RMSprop, and this will be another way for you to speed up your learning algorithm. One fun fact about RMSprop, it was actually first proposed not in an academic research paper, but in a Coursera course that Jeff Hinton had taught on Coursera many years ago. I guess Coursera wasn‘t intended to be a platform for dissemination of novel academic research, but it worked out pretty well in that case. And was really from the Coursera course that RMSprop started to become widely known and it really took off. We talked about momentum. We talked about RMSprop. It turns out that if you put them together you can get an even better optimization algorithm. Let‘s talk about that in the next video.
Adam optimization algorithm - 7m
0:00
During the history of deep learning, many researchers including some very well-known researchers, sometimes proposed optimization algorithms and showed that they worked well in a few problems. But those optimization algorithms subsequently were shown not to really generalize that well to the wide range of neural networks you might want to train. So over time, I think the deep learning community actually developed some amount of skepticism about new optimization algorithms. And a lot of people felt that gradient descent with momentum really works well, was difficult to propose things that work much better. So, rms prop and the Adam optimization algorithm, which we‘ll talk about in this video, is one of those rare algorithms that has really stood up, and has been shown to work well across a wide range of deep learning architectures So, this is one of the algorithms that I wouldn‘t hesitate to recommend you try because many people have tried it and seen it work well on many problems. And the Adam optimization algorithm is basically taking momentum and rms prop and putting them together. So, let‘s see how that works. To implement Adam you would initialize: Vdw=0, Sdw=0, and similarly Vdb, Sdb=0. And then on iteration T, you would compute the derivatives: compute dw, db using current mini-batch. So usually, you do this with mini-batch gradient descent. And then you do the momentum exponentially weighted average. So Vdw = ?. But now I‘m going to this ?1 to distinguish it from the hyper parameter ?2 we‘ll use for the rms prop proportion of this. So, this is exactly what we had when we‘re implementing momentum except it now called hyper parameter ?1 instead of ?. And similarly, you have VDB as follows: 1 - ?1 x db. And then you do the rms prop update as well. So now, you have a different hyperparemeter ?2 plus one minus ?2 dw2. And again, the squaring there is element y squaring of your derivatives dw. And then sdb is equal to this plus one minus ?2 times db. So this is the momentum like update with hyper parameter ?1 and this is the rms prop like update with hyper parameter ?2. In the typical implementation of Adam, you do implement bias correction. So you‘re going to have v corrected. Corrected means after bias correction. Dw = vdw divided by 1 minus ?1 to the power of T if you‘ve done T iterations. And similarly, vdb corrected equals vdb divided by 1 minus ?1 to the power of T. And then similarly, you implement this bias correction on S as well. So, that‘s sdw divided by one minus ?2 to the T and sdb corrected equals sdb divided by 1 minus ?2 to the T. Finally, you perform the update. So W gets updated as W minus alpha times. So if you‘re just implementing momentum you‘d use vdw, vw or maybe vdw corrected. But now, we add in the rms prop portion of this. So we‘re also going to divide by square roots of sdw corrected plus epsilon. And similarly, B gets updated as a similar formula, vdb corrected, divided by square root S, corrected, db, plus epsilon. And so, this algorithm combines the effect of gradient descent with momentum together with gradient descent with rms prop. And this is a commonly used learning algorithm that is proven to be very effective for many different neural networks of a very wide variety of architectures. So, this algorithm has a number of hyper parameters. The learning with hyper parameter alpha is still important and usually needs to be tuned. So you just have to try a range of values and see what works. A common choice really the default choice for ?1 is 0.9. So this is a moving average, weighted average of dw right this is the momentum light term. The hyper parameter for ?2, the authors of the Adam paper, inventors of the Adam algorithm recommend 0.999. Again this is computing the moving weighted average of dw2 as well as db squares. And then Epsilon, the choice of epsilon doesn‘t matter very much. But the authors of the Adam paper recommended it 10 to the minus 8. But this parameter you really don‘t need to set it and it doesn‘t affect performance much at all. But when implementing Adam, what people usually do is just use the default value. So, ?1 and ?2 as well as epsilon. I don‘t think anyone ever really tunes Epsilon. And then, try a range of values of Alpha to see what works best. You could also tune ?1 and ?2 but it‘s not done that often among the practitioners I know. So, where does the term ‘Adam‘ come from? Adam stands for Adaptive Moment Estimation. So ?1 is computing the mean of the derivatives. This is called the first moment. And ?2 is used to compute exponentially weighted average of the 2s and that‘s called the second moment. So that gives rise to the name adaptive moment estimation. But everyone just calls it the Adam authorization algorithm. And, by the way, one of my long term friends and collaborators is call Adam Coates. As far as I know, this algorithm doesn‘t have anything to do with him, except for the fact that I think he uses it sometimes. But sometimes I get asked that question, so just in case you‘re wondering. So, that‘s it for the Adam optimization algorithm. With it, I think you will be able to train your neural networks much more quickly. But before we wrap up for this week, let‘s keep talking about hyper parameter tuning, as well as gain some more intuitions about what the optimization problem for neural networks looks like. In the next video, we‘ll talk about learning rate decay.
Learning rate decay - 6m
0:00
One of the things that might help speed up your learning algorithm, is to slowly reduce your learning rate over time. We call this learning rate decay. Let‘s see how you can implement this. Let‘s start with an example of why you might want to implement learning rate decay. Suppose you‘re implementing mini-batch gradient descent, with a reasonably small mini-batch. Maybe a mini-batch has just 64, 128 examples. Then as you iterate, your steps will be a little bit noisy. And it will tend towards this minimum over here, but it won‘t exactly converge. But your algorithm might just end up wandering around, and never really converge, because you‘re using some fixed value for alpha. And there‘s just some noise in your different mini-batches. But if you were to slowly reduce your learning rate alpha, then during the initial phases, while your learning rate alpha is still large, you can still have relatively fast learning. But then as alpha gets smaller, your steps you take will be slower and smaller. And so you end up oscillating in a tighter region around this minimum, rather than wandering far away, even as training goes on and on. So the intuition behind slowly reducing alpha, is that maybe during the initial steps of learning, you could afford to take much bigger steps. But then as learning approaches converges, then having a slower learning rate allows you to take smaller steps. So here‘s how you can implement learning rate decay. Recall that one epoch is one pass,
1:42
Through the data, right? So if you have a training set as follows, maybe you break it up into different mini-batches. Then the first pass through the training set is called the first epoch, and then the second pass is the second epoch, and so on. So one thing you could do, is set your learning rate alpha = 1 / 1 + a parameter, which I‘m going to call the decay rate,
2:18
Times the epoch-num. And this is going to be times some initial learning rate alpha 0. Note that the decay rate here becomes another hyper-parameter, which you might need to tune. So here‘s a concrete example.
2:35
If you take several epochs, so several passes through your data. If alpha 0 = 0.2, and the decay-rate = 1, then during your first epoch, alpha will be 1 / 1 + 1 * alpha 0. So your learning rate will be 0.1. That‘s just evaluating this formula, when the decay-rate is equal to 1, and the the epoch-num is 1. On the second epoch, your learning rate decays to 0.67. On the third, 0.5, on the fourth, 0.4, and so on. And feel free to evaluate more of these values yourself. And get a sense that, as a function of your epoch number, your learning rate gradually decreases, right, according to this formula up on top. So if you wish to use learning rate decay, what you can do, is try a variety of values of both hyper-parameter alpha 0. As well as this decay rate hyper-parameter, and then try to find the value that works well. Other than this formula for learning rate decay, there are a few other ways that people use. For example, this is called exponential decay. Where alpha is equal to some number less than 1, such as 0.95 times epoch-num, times alpha 0. So this will exponentially quickly decay your learning rate. Other formulas that people use are things like alpha = some constant / epoch-num square root times alpha 0. Or some constant k, another hyper-parameter, over the mini-batch number t, square rooted, times alpha 0. And sometimes you also see people use a learning rate that decreases in discrete steps. Wherefore some number of steps, you have some learning rate, and then after a while you decrease it by one half. After a while by one half. After a while by one half. And so this is a discrete staircase.
4:55
So so far, we‘ve talked about using some formula to govern how alpha, the learning rate, changes over time. One other thing that people sometimes do, is manual decay. And so if you‘re training just one model at a time, and if your model takes many hours, or even many days to train. What some people will do, is just watch your model as it‘s training over a large number of days. And then manually say, it looks like the learning rate slowed down, I‘m going to decrease alpha a little bit. Of course this works, this manually controlling alpha, really tuning alpha by hand, hour by hour, or day by day. This works only if you‘re training only a small number of models, but sometimes people do that as well. So now you have a few more options for how to control the learning rate alpha. Now, in case you‘re thinking, wow, this is a lot of hyper-parameters. How do I select amongst all these different options? I would say, don‘t worry about it for now. In next week, we‘ll talk more about how to systematically choose hyper-parameters. For me, I would say that learning rate decay is usually lower down on the list of things I try. Setting alpha, just a fixed value of alpha, and getting that to be well tuned, has a huge impact. Learning rate decay does help. Sometimes it can really help speed up training, but it is a little bit lower down my list in terms of the things I would try. But next week, when we talk about hyper-parameter tuning, you see more systematic ways to organize all of these hyper-parameters. And how to efficiently search amongst them. So that‘s it for learning rate decay. Finally, I was also going to talk a little bit about local optima, and saddle points, in neural networks. So you can have a little bit better intuition about the types of optimization problems your optimization algorithm is trying to solve, when you‘re trying to train these neural networks. Let‘s go on to the next video to see that.
The problem of local optima - 5m
0:00
In the early days of deep learning, people used to worry a lot about the optimization algorithm getting stuck in bad local optima. But as this theory of deep learning has advanced, our understanding of local optima is also changing. Let me show you how we now think about local optima and problems in the optimization problem in deep learning. This was a picture people used to have in mind when they worried about local optima. Maybe you are trying to optimize some set of parameters, we call them W1 and W2, and the height in the surface is the cost function. In this picture, it looks like there are a lot of local optima in all those places. And it‘d be easy for grading the sense, or one of the other algorithms to get stuck in a local optimum rather than find its way to a global optimum. It turns out that if you are plotting a figure like this in two dimensions, then it‘s easy to create plots like this with a lot of different local optima. And these very low dimensional plots used to guide their intuition. But this intuition isn‘t actually correct. It turns out if you create a neural network, most points of zero gradients are not local optima like points like this. Instead most points of zero gradient in a cost function are saddle points. So, that‘s a point where the zero gradient, again, just is maybe W1, W2, and the height is the value of the cost function J. But informally, a function of very high dimensional space, if the gradient is zero, then in each direction it can either be a convex light function or a concave light function. And if you are in, say, a 20,000 dimensional space, then for it to be a local optima, all 20,000 directions need to look like this. And so the chance of that happening is maybe very small, maybe two to the minus 20,000. Instead you‘re much more likely to get some directions where the curve bends up like so, as well as some directions where the curve function is bending down rather than have them all bend upwards. So that‘s why in very high-dimensional spaces you‘re actually much more likely to run into a saddle point like that shown on the right, then the local optimum. As for why the surface is called a saddle point, if you can picture, maybe this is a sort of saddle you put on a horse, right? Maybe this is a horse. This is a head of a horse, this is the eye of a horse. Well, not a good drawing of a horse but you get the idea. Then you, the rider, will sit here in the saddle. That‘s why this point here, where the derivative is zero, that point is called a saddle point. There‘s really the point on this saddle where you would sit, I guess, and that happens to have derivative zero. And so, one of the lessons we learned in history of deep learning is that a lot of our intuitions about low-dimensional spaces, like what you can plot on the left, they really don‘t transfer to the very high-dimensional spaces that any other algorithms are operating over. Because if you have 20,000 parameters, then J as your function over 20,000 dimensional vector, then you‘re much more likely to see saddle points than local optimum. If local optima aren‘t a problem, then what is a problem? It turns out that plateaus can really slow down learning and a plateau is a region where the derivative is close to zero for a long time. So if you‘re here, then gradient descents will move down the surface, and because the gradient is zero or near zero, the surface is quite flat. You can actually take a very long time, you know, to slowly find your way to maybe this point on the plateau. And then because of a random perturbation of left or right, maybe then finally I‘m going to search pen colors for clarity. Your algorithm can then find its way off the plateau. Let it take this very long slope off before it‘s found its way here and they could get off this plateau. So the takeaways from this video are, first, you‘re actually pretty unlikely to get stuck in bad local optima so long as you‘re training a reasonably large neural network, save a lot of parameters, and the cost function J is defined over a relatively high dimensional space. But second, that plateaus are a problem and you can actually make learning pretty slow. And this is where algorithms like momentum or RmsProp or Adam can really help your learning algorithm as well. And these are scenarios where more sophisticated observation algorithms, such as Adam, can actually speed up the rate at which you could move down the plateau and then get off the plateau. So because your network is solving optimizations problems over such high dimensional spaces, to be honest, I don‘t think anyone has great intuitions about what these spaces really look like, and our understanding of them is still evolving. But I hope this gives you some better intuition about the challenges that the optimization algorithms may face. So that‘s congratulations on coming to the end of this week‘s content. Please take a look at this week‘s quiz as well as the [inaudible] exercise. I hope you enjoy practicing some of these ideas of this week [inaudible] exercise and I look forward to seeing you at the start of next week‘s videos.
(Optional) Heroes of Deep Learning - Yuanqing Lin interview - 13m
0:02
Welcome, Yuanqing. I‘m really glad you could join us today. Sure. Today you are the head of IT research and when the Chinese government, the government of China, was looking for someone to start up and build a National Deep Learning Research Lab, they tapped you to help start this thing. So, you know, arguably, I think maybe you‘re the number one deep learning person in the entire country of China. I‘d like to ask you a lot questions about your work, but before that, I want to hear about your personal story. So how did you end up getting to do this work that you do? Yeah, so, actually, before my Ph.D. program, my major was in Optics, it‘s more like in Physics. I think I had a fairly good kind of background, a very good background, on maths. After I came to the US, then I was thinking, what kind of major can I take for my Ph.D. program? I was thinking, well, I guess I could go for Optics or go for something else. Back to like early 2000, I think nanotechnology was very very hot. But I was thinking that probably I should look at something even more exciting. And that leaves a good chance that I was taking some classes at UPenn and I got to know Dan Lee. So later, he become my Ph.D. adviser. I was thinking, machine learning was a great thing to do. And I got really excited and I changed my major. So, I did my Ph.D at UPenn. I majored in machine learning. I was there for five years and it was kind of very exciting and I learned a lot of things from scratch and lots of algorithms, even like PCAs. I didn‘t know all those before. I feel like I was learning new things every day. So it was a very, very exciting experience for me. This was one of those things of a lot of starts. Although, you know, you just did a lot of work and it was an underappreciated for its time. Right, that‘s right. Yes. So I think NEC was exciting place, and I was there at the beginning as a researcher. Again, I also like to feel like, whoa. I learned lots of things. Actually, later at NEC, I started working on computer visions. I actually started working on computer vision very late, relatively late. And the first thing I did was I participated in ImageNet Challenge. That‘s the first year of ImageNet Challenge. I was kind of managing a team to work on a project. It was lucky, we were quite lucky that we were quite strong, and in the end, we actually got the number one place. Overwhelming, number one place in the contest. So you are the first ever in the world ImageNet Competition winner? Right. Yeah, and I was the person that did a presentation at that workshop. It was a really nice experience for me, and that actually get me into Lisa [inaudible] computer vision tasks. I had been working on Liza [inaudible] probably since then. When New York Times head paper came out, and also later on AlexNet came out, it really blow my mind. I thought, wow, deep learning is so powerful. Since then, I put a lot of effort to work on those. So, as a head of China‘s National Lab, National Research Lab on deep learning, there must be a lot of exciting activities going on there. So, for the global audience, who are watching this, what should they know about what‘s happening with this National Lab? The mission of this National Engineering Lab is to build a really large deep learning platform, and hopefully be the biggest one, or at least the biggest one in China. And this platform would offer people deep learning framework like Pelo Pelo. And we offer people really large scale computing resource, and we also offer people really large, really big kind of data, and if people are able to develop a research or develop good technology on these platform, we also offer them big applications. So for example, the technology can be proved into some big applications in Baidu so the level of technology could get integrate and improve it. So, we believe that combining those resources altogether, I think, is going to be a really powerful platform. You can also get one example on each. For example, right now, if we publish a paper, and someone want to reproduce it, the best thing to do is to provide a code to somewhere, and you could download the code to your computer and that you also try and find that the data sets somewhere. And you probably also need to get good [inaudible] for your computing resources to run smoothly. So this will easily take you some effort at least. National Lab things will become much easier. So if someone‘s using these platform to write the paper, to do that work and write a paper, and the lab who have the code on these platform and the computing structure is already set up for this code, and data‘s allowed too. So basically, you just need a common line to lift the database up. So, this is a huge relief for loss of reproducibility issues in combination with science. So you easily, just few seconds, you can start learning something that you see in a paper. Yeah. So this is extremely powerful. So this is just one example that we are working on to make sure that we are providing a really powerful platform to the community and to the industry. That‘s amazing. That really speeds up deep learning research. Yeah. Can you give a sense of how much resources the Chinese government is putting to back you for this Deep Learning National Lab? So, I think that for this National Engineering Lab, I think government can invest some funding here to build up infrastructure. But I think more importantly, these are going to be a flagship in China that are going to lead a lots of deepening the efforts, including like national project, and the laws of policies, and things. So this is actually extremely powerful, and I think by doing that, we are really honored to get this lab. You are somewhat at the heart of deep learning in China. So, there‘s a lot of activity in China that the global audience isn‘t aware of yet and hasn‘t seen yet, so what should people outside China know about deep learning in China? Yeah. I think in China, especially in the past few years, I think deepening empowered a product, so it‘s really booming, ranging from search engines, to, like, a phrase recognition, surveillance, to the e-commerce, lots of place. I think that they are investing big effort into deep learning and also really make use of technology to make the business much more powerful. And this actually is very important for developing a high technology in general. I think for myself, and also lots of people share this, we believe that actually it‘s really important to this, what‘s often called [inaudible] loop. For example, when we start out to think of building some technologies that will have some initial data, and which we try to do with some initial algorithm, it will launches our initial product for that service. Then, after, later, that will get the data for users. And the others will get more data, so we would develop a better algorithms. Because we see more data, we know what would be the better algorithm. So we have more data and a better algorithm, we will be able to have better technology for product service, and then definitely we hope that we will be able attract more users using the product. The technology is better. And then we will get more data. So this is a very good, positive move. And this is very special, especially for AI related technologies, for traditional technology like a laser. I was working on that before. So the course of technology is going to be more linear. But before, this AI technology, because of this positive loop, you can imagine that definitely, [inaudible] come with really fast growth of technology. And [inaudible] is actually super important when we design a research into them, when we design our ND. We work on the direction that we‘re able to get to this quick improvement period. But if our whole business were not able to fund these positive loop, if we are not able to fund this strong positive loop, this will not work out because someone else will have a strong vision to fund this strong loop and they will get to this place much more quickly than you are. So this [inaudible] an important logic for us when we‘re looking at, say, hey, you need a company, what direction should we work on, and what direction we should not work on. This is definitely a really important factor to look at. Today, both in China, in the U.S., and globally, there are a lot of people wanting to enter deep learning and wanting to enter AI. What advice would you have for someone that wants to get into this field? So now, there‘s definitely people who will start with open source frameworks. I think that‘s extremely powerful to many starters. I think when I was studying my deep learning study, there was not much open source resources. I think nowadays, in AI, especially in Deep learning, it‘s a very good community, and there are multiple really good people in the frameworks. It‘s always [inaudible] , a cafe. Now they call it cafe too, right? In China we have a really good Pelo-pelo. And even in most [inaudible] online, they have lots of courses to teach you how to use those. And also, nowadays, there‘s also many publically available benchmark and the people will see, hey, really skillful, really experienced people like, how well they could do on those benchmark. So basically, time to get familiar with deep learning. I think those are really good starting point. How did you gain that understanding? Actually, I do it the opposite way. I learned PCA, I learned LDA, all of those before I learned deep learning actually. So, basically, it‘s also a good way, I feel. We kind of lay down lots of foundations. We learned graphic model. So these are all important. Although right now, deep learning is beyond [inaudible]. But knowing laws, will actually give you very good intuition about how deep learning works. And then one day, probably leads connection of deep learning into laws like a framework or approach. I think there are already lots of connections. And the laws actually make deep learning richer. I mean, there are richer ways of doing Deep Learning. Yeah. So, I feel like it‘s good to start with lots of open source, and those are extremely powerful resource. I will also suggest that you also do learn those basic things about machine learning. So, thank you. That was fascinating. Even though I‘ve known you for a long time, there are a lot of details you‘re thinking that I didn‘t realize until now. So, thank you very much. Thanks so much for having me.
Hyperparameter tuning, Batch Normalization and Programming Frameworks
Tuning process - 7m
0:00
Hi, and welcome back. You‘ve seen by now that changing neural nets can involve setting a lot of different hyperparameters. Now, how do you go about finding a good setting for these hyperparameters? In this video, I want to share with you some guidelines, some tips for how to systematically organize your hyperparameter tuning process, which hopefully will make it more efficient for you to converge on a good setting of the hyperparameters. One of the painful things about training deepness is the sheer number of hyperparameters you have to deal with, ranging from the learning rate alpha to the momentum term beta, if using momentum, or the hyperparameters for the Adam Optimization Algorithm which are beta one, beta two, and epsilon. Maybe you have to pick the number of layers, maybe you have to pick the number of hidden units for the different layers, and maybe you want to use learning rate decay, so you don‘t just use a single learning rate alpha. And then of course, you might need to choose the mini-batch size. So it turns out, some of these hyperparameters are more important than others. The most learning applications I would say, alpha, the learning rate is the most important hyperparameter to tune. Other than alpha, a few other hyperparameters I tend to would maybe tune next, would be maybe the momentum term, say, 0.9 is a good default. I‘d also tune the mini-batch size to make sure that the optimization algorithm is running efficiently. Often I also fiddle around with the hidden units. Of the ones I‘ve circled in orange, these are really the three that I would consider second in importance to the learning rate alpha and then third in importance after fiddling around with the others, the number of layers can sometimes make a huge difference, and so can learning rate decay. And then when using the Adam algorithm I actually pretty much never tuned beta one, beta two, and epsilon. Pretty much I always use 0.9, 0.999 and tenth minus eight although you can try tuning those as well if you wish. But hopefully it does give you some rough sense of what hyperparameters might be more important than others, alpha, most important, for sure, followed maybe by the ones I‘ve circle in orange, followed maybe by the ones I circled in purple. But this isn‘t a hard and fast rule and I think other deep learning practitioners may well disagree with me or have different intuitions on these. Now, if you‘re trying to tune some set of hyperparameters, how do you select a set of values to explore? In earlier generations of machine learning algorithms, if you had two hyperparameters, which I‘m calling hyperparameter one and hyperparameter two here, it was common practice to sample the points in a grid like so and systematically explore these values. Here I am placing down a five by five grid. In practice, it could be more or less than the five by five grid but you try out in this example all 25 points and then pick whichever hyperparameter works best. And this practice works okay when the number of hyperparameters was relatively small. In deep learning, what we tend to do, and what I recommend you do instead, is choose the points at random. So go ahead and choose maybe of same number of points, right? 25 points and then try out the hyperparameters on this randomly chosen set of points. And the reason you do that is that it‘s difficult to know in advance which hyperparameters are going to be the most important for your problem. And as you saw in the previous slide, some hyperparameters are actually much more important than others. So to take an example, let‘s say hyperparameter one turns out to be alpha, the learning rate. And to take an extreme example, let‘s say that hyperparameter two was that value epsilon that you have in the denominator of the Adam algorithm. So your choice of alpha matters a lot and your choice of epsilon hardly matters. So if you sample in the grid then you‘ve really tried out five values of alpha and you might find that all of the different values of epsilon give you essentially the same answer. So you‘ve now trained 25 models and only got into trial five values for the learning rate alpha, which I think is really important. Whereas in contrast, if you were to sample at random, then you will have tried out 25 distinct values of the learning rate alpha and therefore you be more likely to find a value that works really well. I‘ve explained this example, using just two hyperparameters. In practice, you might be searching over many more hyperparameters than these, so if you have, say, three hyperparameters, I guess instead of searching over a square, you‘re searching over a cube where this third dimension is hyperparameter three and then by sampling within this three-dimensional cube you get to try out a lot more values of each of your three hyperparameters. And in practice you might be searching over even more hyperparameters than three and sometimes it‘s just hard to know in advance which ones turn out to be the really important hyperparameters for your application and sampling at random rather than in the grid shows that you are more richly exploring set of possible values for the most important hyperparameters, whatever they turn out to be. When you sample hyperparameters, another common practice is to use a coarse to fine sampling scheme. So let‘s say in this two-dimensional example that you sample these points, and maybe you found that this point work the best and maybe a few other points around it tended to work really well, then in the course of the final scheme what you might do is zoom in to a smaller region of the hyperparameters and then sample more density within this space. Or maybe again at random, but to then focus more resources on searching within this blue square if you‘re suspecting that the best setting, the hyperparameters, may be in this region. So after doing a coarse sample of this entire square, that tells you to then focus on a smaller square. You can then sample more densely into smaller square. So this type of a coarse to fine search is also frequently used. And by trying out these different values of the hyperparameters you can then pick whatever value allows you to do best on your training set objective or does best on your development set or whatever you‘re trying to optimize in your hyperparameter search process. So I hope this gives you a way to more systematically organize your hyperparameter search process. The two key takeaways are, use random sampling and adequate search and optionally consider implementing a coarse to fine search process. But there‘s even more to hyperparameter search than this. Let‘s talk more in the next video about how to choose the right scale on which to sample your hyperparameters.
Using an appropriate scale to pick hyperparameters - 8m
0:00
In the last video, you saw how sampling at random, over the range of hyperparameters, can allow you to search over the space of hyperparameters more efficiently. But it turns out that sampling at random doesn‘t mean sampling uniformly at random, over the range of valid values. Instead, it‘s important to pick the appropriate scale on which to explore the hyperparamaters. In this video, I want to show you how to do that. Let‘s say that you‘re trying to choose the number of hidden units, n[l], for a given layer l. And let‘s say that you think a good range of values is somewhere from 50 to 100. In that case, if you look at the number line from 50 to 100, maybe picking some number values at random within this number line. There‘s a pretty visible way to search for this particular hyperparameter. Or if you‘re trying to decide on the number of layers in your neural network, we‘re calling that capital L. Maybe you think the total number of layers should be somewhere between 2 to 4. Then sampling uniformly at random, along 2, 3 and 4, might be reasonable. Or even using a grid search, where you explicitly evaluate the values 2, 3 and 4 might be reasonable. So these were a couple examples where sampling uniformly at random over the range you‘re contemplating, might be a reasonable thing to do. But this is not true for all hyperparameters. Let‘s look at another example. Say your searching for the hyperparameter alpha, the learning rate. And let‘s say that you suspect 0.0001 might be on the low end, or maybe it could be as high as 1. Now if you draw the number line from 0.0001 to 1, and sample values uniformly at random over this number line. Well about 90% of the values you sample would be between 0.1 and 1. So you‘re using 90% of the resources to search between 0.1 and 1, and only 10% of the resources to search between 0.0001 and 0.1. So that doesn‘t seem right. Instead, it seems more reasonable to search for hyperparameters on a log scale. Where instead of using a linear scale, you‘d have 0.0001 here, and then 0.001, 0.01, 0.1, and then 1. And you instead sample uniformly, at random, on this type of logarithmic scale. Now you have more resources dedicated to searching between 0.0001 and 0.001, and between 0.001 and 0.01, and so on. So in Python, the way you implement this,
2:55
is let r = -4 * np.random.rand(). And then a randomly chosen value of alpha, would be alpha = 10 to the power of r.
3:08
So after this first line, r will be a random number between -4 and 0. And so alpha here will be between 10 to the -4 and 10 to the 0. So 10 to the -4 is this left thing, this 10 to the -4. And 1 is 10 to the 0. In a more general case, if you‘re trying to sample between 10 to the a, to 10 to the b, on the log scale. And in this example, this is 10 to the a. And you can figure out what a is by taking the log base 10 of 0.0001, which is going to tell you a is -4. And this value on the right, this is 10 to the b. And you can figure out what b is, by taking log base 10 of 1, which tells you b is equal to 0.
3:58
So what you do, is then sample r uniformly, at random, between a and b. So in this case, r would be between -4 and 0. And you can set alpha, on your randomly sampled hyperparameter value, as 10 to the r, okay? So just to recap, to sample on the log scale, you take the low value, take logs to figure out what is a. Take the high value, take a log to figure out what is b. So now you‘re trying to sample, from 10 to the a to the b, on a log scale. So you set r uniformly, at random, between a and b. And then you set the hyperparameter to be 10 to the r. So that‘s how you implement sampling on this logarithmic scale. Finally, one other tricky case is sampling the hyperparameter beta, used for computing exponentially weighted averages. So let‘s say you suspect that beta should be somewhere between 0.9 to 0.999. Maybe this is the range of values you want to search over. So remember, that when computing exponentially weighted averages, using 0.9 is like averaging over the last 10 values. kind of like taking the average of 10 days temperature, whereas using 0.999 is like averaging over the last 1,000 values. So similar to what we saw on the last slide, if you want to search between 0.9 and 0.999, it doesn‘t make sense to sample on the linear scale, right? Uniformly, at random, between 0.9 and 0.999. So the best way to think about this, is that we want to explore the range of values for 1 minus beta, which is going to now range from 0.1 to 0.001. And so we‘ll sample the between beta, taking values from 0.1, to maybe 0.1, to 0.001. So using the method we have figured out on the previous slide, this is 10 to the -1, this is 10 to the -3. Notice on the previous slide, we had the small value on the left, and the large value on the right, but here we have reversed. We have the large value on the left, and the small value on the right. So what you do, is you sample r uniformly, at random, from -3 to -1. And you set 1- beta = 10 to the r, and so beta = 1- 10 to the r. And this becomes your randomly sampled value of your hyperparameter, chosen on the appropriate scale. And hopefully this makes sense, in that this way, you spend as much resources exploring the range 0.9 to 0.99, as you would exploring 0.99 to 0.999. So if you want to study more formal mathematical justification for why we‘re doing this, right, why is it such a bad idea to sample in a linear scale? It is that, when beta is close to 1, the sensitivity of the results you get changes, even with very small changes to beta. So if beta goes from 0.9 to 0.9005, it‘s no big deal, this is hardly any change in your results. But if beta goes from 0.999 to 0.9995, this will have a huge impact on exactly what your algorithm is doing, right? In both of these cases, it‘s averaging over roughly 10 values. But here it‘s gone from an exponentially weighted average over about the last 1,000 examples, to now, the last 2,000 examples. And it‘s because that formula we have, 1 / 1- beta, this is very sensitive to small changes in beta, when beta is close to 1. So what this whole sampling process does, is it causes you to sample more densely in the region of when beta is close to 1.
7:59
Or, alternatively, when 1- beta is close to 0. So that you can be more efficient in terms of how you distribute the samples, to explore the space of possible outcomes more efficiently. So I hope this helps you select the right scale on which to sample the hyperparameters. In case you don‘t end up making the right scaling decision on some hyperparameter choice, don‘t worry to much about it. Even if you sample on the uniform scale, where sum of the scale would have been superior, you might still get okay results. Especially if you use a coarse to fine search, so that in later iterations, you focus in more on the most useful range of hyperparameter values to sample. I hope this helps you in your hyperparameter search. In the next video, I also want to share with you some thoughts of how to organize your hyperparameter search process. That I hope will make your workflow a bit more efficient.
Hyperparameters tuning in practice: Pandas vs. Caviar - 6m
0:00
You have now heard a lot about how to search for good hyperparameters. Before wrapping up our discussion on hyperparameter search, I want to share with you just a couple of final tips and tricks for how to organize your hyperparameter search process. Deep learning today is applied to many different application areas and that intuitions about hyperparameter settings from one application area may or may not transfer to a different one. There is a lot of cross-fertilization among different applications‘ domains, so for example, I‘ve seen ideas developed in the computer vision community, such as Confonets or ResNets, which we‘ll talk about in a later course, successfully applied to speech. I‘ve seen ideas that were first developed in speech successfully applied in NLP, and so on. So one nice development in deep learning is that people from different application domains do read increasingly research papers from other application domains to look for inspiration for cross-fertilization. In terms of your settings for the hyperparameters, though, I‘ve seen that intuitions do get stale. So even if you work on just one problem, say logistics, you might have found a good setting for the hyperparameters and kept on developing your algorithm, or maybe seen your data gradually change over the course of several months, or maybe just upgraded servers in your data center. And because of those changes, the best setting of your hyperparameters can get stale. So I recommend maybe just retesting or reevaluating your hyperparameters at least once every several months to make sure that you‘re still happy with the values you have. Finally, in terms of how people go about searching for hyperparameters, I see maybe two major schools of thought, or maybe two major different ways in which people go about it. One way is if you babysit one model. And usually you do this if you have maybe a huge data set but not a lot of computational resources, not a lot of CPUs and GPUs, so you can basically afford to train only one model or a very small number of models at a time. In that case you might gradually babysit that model even as it‘s training. So, for example, on Day 0 you might initialize your parameter as random and then start training. And you gradually watch your learning curve, maybe the cost function J or your dataset error or something else, gradually decrease over the first day. Then at the end of day one, you might say, gee, looks it‘s learning quite well, I‘m going to try increasing the learning rate a little bit and see how it does. And then maybe it does better. And then that‘s your Day 2 performance. And after two days you say, okay, it‘s still doing quite well. Maybe I‘ll fill the momentum term a bit or decrease the learning variable a bit now, and then you‘re now into Day 3. And every day you kind of look at it and try nudging up and down your parameters. And maybe on one day you found your learning rate was too big. So you might go back to the previous day‘s model, and so on. But you‘re kind of babysitting the model one day at a time even as it‘s training over a course of many days or over the course of several different weeks. So that‘s one approach, and people that babysit one model, that is watching performance and patiently nudging the learning rate up or down. But that‘s usually what happens if you don‘t have enough computational capacity to train a lot of models at the same time. The other approach would be if you train many models in parallel. So you might have some setting of the hyperparameters and just let it run by itself ,either for a day or even for multiple days, and then you get some learning curve like that; and this could be a plot of the cost function J or cost of your training error or cost of your dataset error, but some metric in your tracking. And then at the same time you might start up a different model with a different setting of the hyperparameters. And so, your second model might generate a different learning curve, maybe one that looks like that. I will say that one looks better. And at the same time, you might train a third model, which might generate a learning curve that looks like that, and another one that, maybe this one diverges so it looks like that, and so on. Or you might train many different models in parallel, where these orange lines are different models, right, and so this way you can try a lot of different hyperparameter settings and then just maybe quickly at the end pick the one that works best. Looks like in this example it was, maybe this curve that look best. So to make an analogy, I‘m going to call the approach on the left the panda approach. When pandas have children, they have very few children, usually one child at a time, and then they really put a lot of effort into making sure that the baby panda survives. So that‘s really babysitting. One model or one baby panda. Whereas the approach on the right is more like what fish do. I‘m going to call this the caviar strategy. There‘s some fish that lay over 100 million eggs in one mating season. But the way fish reproduce is they lay a lot of eggs and don‘t pay too much attention to any one of them but just see that hopefully one of them, or maybe a bunch of them, will do well. So I guess, this is really the difference between how mammals reproduce versus how fish and a lot of reptiles reproduce. But I‘m going to call it the panda approach versus the caviar approach, since that‘s more fun and memorable. So the way to choose between these two approaches is really a function of how much computational resources you have. If you have enough computers to train a lot of models in parallel,
5:31
then by all means take the caviar approach and try a lot of different hyperparameters and see what works. But in some application domains, I see this in some online advertising settings as well as in some computer vision applications, where there‘s just so much data and the models you want to train are so big that it‘s difficult to train a lot of models at the same time. It‘s really application dependent of course, but I‘ve seen those communities use the panda approach a little bit more, where you are kind of babying a single model along and nudging the parameters up and down and trying to make this one model work. Although, of course, even the panda approach, having trained one model and then seen it work or not work, maybe in the second week or the third week, maybe I should initialize a different model and then baby that one along just like even pandas, I guess, can have multiple children in their lifetime, even if they have only one, or a very small number of children, at any one time. So hopefully this gives you a good sense of how to go about the hyperparameter search process. Now, it turns out that there‘s one other technique that can make your neural network much more robust to the choice of hyperparameters. It doesn‘t work for all neural networks, but when it does, it can make the hyperparameter search much easier and also make training go much faster. Let‘s talk about this technique in the next video.
Normalizing activations in a network - 8m
0:00
In the rise of deep learning, one of the most important ideas has been an algorithm called batch normalization, created by two researchers, Sergey Ioffe and Christian Szegedy. Batch normalization makes your hyperparameter search problem much easier, makes your neural network much more robust. The choice of hyperparameters is a much bigger range of hyperparameters that work well, and will also enable you to much more easily train even very deep networks. Let‘s see how batch normalization works. When training a model, such as logistic regression, you might remember that normalizing the input features can speed up learnings in compute the means, subtract off the means from your training sets. Compute the variances.
0:44
The sum of xi squared. This is an element-wise squaring.
0:49
And then normalize your data set according to the variances. And we saw in an earlier video how this can turn the contours of your learning problem from something that might be very elongated to something that is more round, and easier for an algorithm like gradient descent to optimize. So this works, in terms of normalizing the input feature values to a neural network, alter the regression. Now, how about a deeper model? You have not just input features x, but in this layer you have activations a1, in this layer, you have activations a2 and so on. So if you want to train the parameters, say w3, b3, then
1:32
wouldn‘t it be nice if you can normalize the mean and variance of a2 to make the training of w3, b3 more efficient?
1:43
In the case of logistic regression, we saw how normalizing x1, x2, x3 maybe helps you train w and b more efficiently. So here, the question is, for any hidden layer, can we normalize,
1:57
The values of a, let‘s say a2, in this example but really any hidden layer, so as to train w3 b3 faster, right? Since a2 is the input to the next layer, that therefore affects your training of w3 and b3.
2:20
So this is what batch norm does, batch normalization, or batch norm for short, does. Although technically, we‘ll actually normalize the values of not a2 but z2. There are some debates in the deep learning literature about whether you should normalize the value before the activation function, so z2, or whether you should normalize the value after applying the activation function, a2. In practice, normalizing z2 is done much more often. So that‘s the version I‘ll present and what I would recommend you use as a default choice. So here is how you will implement batch norm. Given some intermediate values, In your neural net,
3:09
Let‘s say that you have some hidden unit values z1 up to zm, and this is really from some hidden layer, so it‘d be more accurate to write this as z for some hidden layer i for i equals 1 through m. But to reduce writing, I‘m going to omit this [l], just to simplify the notation on this line. So given these values, what you do is compute the mean as follows. Okay, and all this is specific to some layer l, but I‘m omitting the [l]. And then you compute the variance using pretty much the formula you would expect and then you would take each the zis and normalize it. So you get zi normalized by subtracting off the mean and dividing by the standard deviation. For numerical stability, we usually add epsilon to the denominator like that just in case sigma squared turns out to be zero in some estimate. And so now we‘ve taken these values z and normalized them to have mean 0 and standard unit variance. So every component of z has mean 0 and variance 1. But we don‘t want the hidden units to always have mean 0 and variance 1. Maybe it makes sense for hidden units to have a different distribution, so what we‘ll do instead is compute, I‘m going to call this z tilde = gamma zi norm + beta. And here, gamma and beta are learnable parameters of your model.
4:58
So we‘re using gradient descent, or some other algorithm, like the gradient descent of momentum, or rms proper atom, you would update the parameters gamma and beta, just as you would update the weights of your neural network. Now, notice that the effect of gamma and beta is that it allows you to set the mean of z tilde to be whatever you want it to be. In fact, if gamma equals square root sigma squared
5:28
plus epsilon, so if gamma were equal to this denominator term. And if beta were equal to mu, so this value up here, then the effect of gamma z norm plus beta is that it would exactly invert this equation. So if this is true, then actually z tilde i is equal to zi. And so by an appropriate setting of the parameters gamma and beta, this normalization step, that is, these four equations is just computing essentially the identity function. But by choosing other values of gamma and beta, this allows you to make the hidden unit values have other means and variances as well. And so the way you fit this into your neural network is, whereas previously you were using these values z1, z2, and so on, you would now use z tilde i, Instead of zi for the later computations in your neural network. And you want to put back in this [l] to explicitly denote which layer it is in, you can put it back there. So the intuition I hope you‘ll take away from this is that we saw how normalizing the input features x can help learning in a neural network. And what batch norm does is it applies that normalization process not just to the input layer, but to the values even deep in some hidden layer in the neural network. So it will apply this type of normalization to normalize the mean and variance of some of your hidden units‘ values, z. But one difference between the training input and these hidden unit values is you might not want your hidden unit values be forced to have mean 0 and variance 1. For example, if you have a sigmoid activation function, you don‘t want your values to always be clustered here. You might want them to have a larger variance or have a mean that‘s different than 0, in order to better take advantage of the nonlinearity of the sigmoid function rather than have all your values be in just this linear regime. So that‘s why with the parameters gamma and beta, you can now make sure that your zi values have the range of values that you want. But what it does really is it then shows that your hidden units have standardized mean and variance, where the mean and variance are controlled by two explicit parameters gamma and beta which the learning algorithm can set to whatever it wants. So what it really does is it normalizes in mean and variance of these hidden unit values, really the zis, to have some fixed mean and variance. And that mean and variance could be 0 and 1, or it could be some other value, and it‘s controlled by these parameters gamma and beta. So I hope that gives you a sense of the mechanics of how to implement batch norm, at least for a single layer in the neural network. In the next video, I‘m going to show you how to fit batch norm into a neural network, even a deep neural network, and how to make it work for the many different layers of a neural network. And after that, we‘ll get some more intuition about why batch norm could help you train your neural network. So in case why it works still seems a little bit mysterious, stay with me, and I think in two videos from now we‘ll really make that clearer.
Fitting Batch Norm into a neural network - 12m
0:00
So you have seen the equations for how to invent Batch Norm for maybe a single hidden layer. Let‘s see how it fits into the training of a deep network. So, let‘s say you have a neural network like this, you‘ve seen me say before that you can view each of the unit as computing two things. First, it computes Z and then it applies the activation function to compute A. And so we can think of each of these circles as representing a two step computation. And similarly for the next layer, that is Z2 1, and A2 1, and so on. So, if you were not applying Batch Norm, you would have an input X fit into the first hidden layer, and then first compute Z1, and this is governed by the parameters W1 and B1. And then ordinarily, you would fit Z1 into the activation function to compute A1. But what would do in Batch Norm is take this value Z1, and apply Batch Norm, sometimes abbreviated BN to it, and that‘s going to be governed by parameters, Beta 1 and Gamma 1, and this will give you this new normalize value Z1. And then you fit that to the activation function to get A1, which is G1 applied to Z tilde 1. Now, you‘ve done the computation for the first layer, where this Batch Norms that really occurs in between the computation from Z and A. Next, you take this value A1 and use it to compute Z2, and so this is now governed by W2, B2. And similar to what you did for the first layer, you would take Z2 and apply it through Batch Norm, and we abbreviate it to BN now. This is governed by Batch Norm parameters specific to the next layer. So Beta 2, Gamma 2, and now this gives you Z tilde 2, and you use that to compute A2 by applying the activation function, and so on. So once again, the Batch Norms that happens between computing Z and computing A. And the intuition is that, instead of using the un-normalized value Z, you can use the normalized value Z tilde, that‘s the first layer. The second layer as well, instead of using the un-normalized value Z2, you can use the mean and variance normalized values Z tilde 2. So the parameters of your network are going to be W1, B1. It turns out we‘ll get rid of the parameters but we‘ll see why in the next slide. But for now, imagine the parameters are the usual W1. B1, WL, BL, and we have added to this new network, additional parameters Beta 1, Gamma 1, Beta 2, Gamma 2, and so on, for each layer in which you are applying Batch Norm. For clarity, note that these Betas here, these have nothing to do with the hyperparameter beta that we had for momentum over the computing the various exponentially weighted averages. The authors of the Adam paper use Beta on their paper to denote that hyperparameter, the authors of the Batch Norm paper had used Beta to denote this parameter, but these are two completely different Betas. I decided to stick with Beta in both cases, in case you read the original papers. But the Beta 1, Beta 2, and so on, that Batch Norm tries to learn is a different Beta than the hyperparameter Beta used in momentum and the Adam and RMSprop algorithms. So now that these are the new parameters of your algorithm, you would then use whether optimization you want, such as creating descent in order to implement it. For example, you might compute D Beta L for a given layer, and then update the parameters Beta, gets updated as Beta minus learning rate times D Beta L. And you can also use Adam or RMSprop or momentum in order to update the parameters Beta and Gamma, not just creating descent. And even though in the previous video, I had explained what the Batch Norm operation does, computes mean and variances and subtracts and divides by them. If they are using a Deep Learning Programming Framework, usually you won‘t have to implement the Batch Norm step on Batch Norm layer yourself. So the probing frameworks, that can be sub one line of code. So for example, in terms of flow framework, you can implement Batch Normalization with this function. We‘ll talk more about probing frameworks later, but in practice you might not end up needing to implement all these details yourself, knowing how it works so that you can get a better understanding of what your code is doing. But implementing Batch Norm is often one line of code in the deep learning frameworks. Now, so far, we‘ve talked about Batch Norm as if you were training on your entire training site at the time as if you are using Batch gradient descent. In practice, Batch Norm is usually applied with mini-batches of your training set. So the way you actually apply Batch Norm is you take your first mini-batch and compute Z1. Same as we did on the previous slide using the parameters W1, B1 and then you take just this mini-batch and computer mean and variance of the Z1 on just this mini batch and then Batch Norm would subtract by the mean and divide by the standard deviation and then re-scale by Beta 1, Gamma 1, to give you Z1, and all this is on the first mini-batch, then you apply the activation function to get A1, and then you compute Z2 using W2, B2, and so on. So you do all this in order to perform one step of gradient descent on the first mini-batch and then goes to the second mini-batch X2, and you do something similar where you will now compute Z1 on the second mini-batch and then use Batch Norm to compute Z1 tilde. And so here in this Batch Norm step, You would be normalizing Z tilde using just the data in your second mini-batch, so does Batch Norm step here. Let‘s look at the examples in your second mini-batch, computing the mean and variances of the Z1‘s on just that mini-batch and re-scaling by Beta and Gamma to get Z tilde, and so on. And you do this with a third mini-batch, and keep training. Now, there‘s one detail to the parameterization that I want to clean up, which is previously, I said that the parameters was WL, BL, for each layer as well as Beta L, and Gamma L. Now notice that the way Z was computed is as follows, ZL = WL x A of L - 1 + B of L. But what Batch Norm does, is it is going to look at the mini-batch and normalize ZL to first of mean 0 and standard variance, and then a rescale by Beta and Gamma. But what that means is that, whatever is the value of BL is actually going to just get subtracted out, because during that Batch Normalization step, you are going to compute the means of the ZL‘s and subtract the mean. And so adding any constant to all of the examples in the mini-batch, it doesn‘t change anything. Because any constant you add will get cancelled out by the mean subtractions step. So, if you‘re using Batch Norm, you can actually eliminate that parameter, or if you want, think of it as setting it permanently to 0. So then the parameterization becomes ZL is just WL x AL - 1, And then you compute ZL normalized, and we compute Z tilde = Gamma ZL + Beta, you end up using this parameter Beta L in order to decide whats that mean of Z tilde L. Which is why guess post in this layer. So just to recap, because Batch Norm zeroes out the mean of these ZL values in the layer, there‘s no point having this parameter BL, and so you must get rid of it, and instead is sort of replaced by Beta L, which is a parameter that controls that ends up affecting the shift or the biased terms. Finally, remember that the dimension of ZL, because if you‘re doing this on one example, it‘s going to be NL by 1, and so BL, a dimension, NL by one, if NL was the number of hidden units in layer L. And so the dimension of Beta L and Gamma L is also going to be NL by 1 because that‘s the number of hidden units you have. You have NL hidden units, and so Beta L and Gamma L are used to scale the mean and variance of each of the hidden units to whatever the network wants to set them to. So, let‘s pull all together and describe how you can implement gradient descent using Batch Norm. Assuming you‘re using mini-batch gradient descent, it rates for T = 1 to the number of many batches. You would implement forward prop on mini-batch XT and doing forward prop in each hidden layer, use Batch Norm to replace ZL with Z tilde L. And so then it shows that within that mini-batch, the value Z end up with some normalized mean and variance and the values and the version of the normalized mean that and variance is Z tilde L. And then, you use back prop to compute DW, DB, for all the values of L, D Beta, D Gamma. Although, technically, since you have got to get rid of B, this actually now goes away. And then finally, you update the parameters. So, W gets updated as W minus the learning rate times, as usual, Beta gets updated as Beta minus learning rate times DB, and similarly for Gamma. And if you have computed the gradient as follows, you could use gradient descent. That‘s what I‘ve written down here, but this also works with gradient descent with momentum, or RMSprop, or Adam. Where instead of taking this gradient descent update,nini-batch you could use the updates given by these other algorithms as we discussed in the previous week‘s videos. Some of these other optimization algorithms as well can be used to update the parameters Beta and Gamma that Batch Norm added to algorithm. So, I hope that gives you a sense of how you could implement Batch Norm from scratch if you wanted to. If you‘re using one of the Deep Learning Programming frameworks which we will talk more about later, hopefully you can just call someone else‘s implementation in the Programming framework which will make using Batch Norm much easier. Now, in case Batch Norm still seems a little bit mysterious if you‘re still not quite sure why it speeds up training so dramatically, let‘s go to the next video and talk more about why Batch Norm really works and what it is really doing.
Why does Batch Norm work? - 11m
0:00
So, why does batch norm work? Here‘s one reason, you‘ve seen how normalizing the input features, the X‘s, to mean zero and variance one, how that can speed up learning. So rather than having some features that range from zero to one, and some from one to a 1,000, by normalizing all the features, input features X, to take on a similar range of values that can speed up learning. So, one intuition behind why batch norm works is, this is doing a similar thing, but further values in your hidden units and not just for your input there. Now, this is just a partial picture for what batch norm is doing. There are a couple of further intuitions, that will help you gain a deeper understanding of what batch norm is doing. Let‘s take a look at those in this video. A second reason why batch norm works, is it makes weights, later or deeper than your network, say the weight on layer 10, more robust to changes to weights in earlier layers of the neural network, say, in layer one. To explain what I mean, let‘s look at this most vivid example. Let‘s see a training on network, maybe a shallow network, like logistic regression or maybe a neural network, maybe a shallow network like this regression or maybe a deep network, on our famous cat detection toss. But let‘s say that you‘ve trained your data sets on all images of black cats. If you now try to apply this network to data with colored cats where the positive examples are not just black cats like on the left, but to color cats like on the right, then your cosfa might not do very well. So in pictures, if your training set looks like this, where you have positive examples here and negative examples here, but you were to try to generalize it, to a data set where maybe positive examples are here and the negative examples are here, then you might not expect a module trained on the data on the left to do very well on the data on the right. Even though there might be the same function that actually works well, but you wouldn‘t expect your learning algorithm to discover that green decision boundary, just looking at the data on the left. So, this idea of your data distribution changing goes by the somewhat fancy name, covariate shift. And the idea is that, if you‘ve learned some X to Y mapping, if the distribution of X changes, then you might need to retrain your learning algorithm. And this is true even if the function, the ground true function, mapping from X to Y, remains unchanged, which it is in this example, because the ground true function is, is this picture a cat or not. And the need to retain your function becomes even more acute or it becomes even worse if the ground true function shifts as well. So, how does this problem of covariate shift apply to a neural network? Consider a deep network like this, and let‘s look at the learning process from the perspective of this certain layer, the third hidden layer. So this network has learned the parameters W3 and B3. And from the perspective of the third hidden layer, it gets some set of values from the earlier layers, and then it has to do some stuff to hopefully make the output Y-hat close to the ground true value Y. So let me cover up the nose on the left for a second. So from the perspective of this third hidden layer, it gets some values, let‘s call them A_2_1, A_2_2, A_2_3, and A_2_4. But these values might as well be features X1, X2, X3, X4, and the job of the third hidden layer is to take these values and find a way to map them to Y-hat. So you can imagine doing great intercepts, so that these parameters W_3_B_3 as well as maybe W_4_B_4, and even W_5_B_5, maybe try and learn those parameters, so the network does a good job, mapping from the values I drew in black on the left to the output values Y-hat. But now let‘s uncover the left of the network again. The network is also adapting parameters W_2_B_2 and W_1B_1, and so as these parameters change, these values, A_2, will also change. So from the perspective of the third hidden layer, these hidden unit values are changing all the time, and so it‘s suffering from the problem of covariate shift that we talked about on the previous slide. So what batch norm does, is it reduces the amount that the distribution of these hidden unit values shifts around. And if it were to plot the distribution of these hidden unit values, maybe this is technically renormalizer Z, so this is actually Z_2_1 and Z_2_2, and I also plot two values instead of four values, so we can visualize in 2D. What batch norm is saying is that, the values for Z_2_1 Z and Z_2_2 can change, and indeed they will change when the neural network updates the parameters in the earlier layers. But what batch norm ensures is that no matter how it changes, the mean and variance of Z_2_1 and Z_2_2 will remain the same. So even if the exact values of Z_2_1 and Z_2_2 change, their mean and variance will at least stay same mean zero and variance one. Or, not necessarily mean zero and variance one, but whatever value is governed by beta two and gamma two. Which, if the neural networks chooses, can force it to be mean zero and variance one. Or, really, any other mean and variance. But what this does is, it limits the amount to which updating the parameters in the earlier layers can affect the distribution of values that the third layer now sees and therefore has to learn on. And so, batch norm reduces the problem of the input values changing, it really causes these values to become more stable, so that the later layers of the neural network has more firm ground to stand on. And even though the input distribution changes a bit, it changes less, and what this does is, even as the earlier layers keep learning, the amounts that this forces the later layers to adapt to as early as layer changes is reduced or, if you will, it weakens the coupling between what the early layers parameters has to do and what the later layers parameters have to do. And so it allows each layer of the network to learn by itself, a little bit more independently of other layers, and this has the effect of speeding up of learning in the whole network. So I hope this gives some better intuition, but the takeaway is that batch norm means that, especially from the perspective of one of the later layers of the neural network, the earlier layers don‘t get to shift around as much, because they‘re constrained to have the same mean and variance. And so this makes the job of learning on the later layers easier. It turns out batch norm has a second effect, it has a slight regularization effect. So one non-intuitive thing of a batch norm is that each mini-batch, I will say mini-batch X_t, has the values Z_t, has the values Z_l, scaled by the mean and variance computed on just that one mini-batch. Now, because the mean and variance computed on just that mini-batch as opposed to computed on the entire data set, that mean and variance has a little bit of noise in it, because it‘s computed just on your mini-batch of, say, 64, or 128, or maybe 256 or larger training examples. So because the mean and variance is a little bit noisy because it‘s estimated with just a relatively small sample of data, the scaling process, going from Z_l to Z_2_l, that process is a little bit noisy as well, because it‘s computed, using a slightly noisy mean and variance. So similar to dropout, it adds some noise to each hidden layer‘s activations. The way dropout has noises, it takes a hidden unit and it multiplies it by zero with some probability. And multiplies it by one with some probability. And so your dropout has multiple of noise because it‘s multiplied by zero or one, whereas batch norm has multiples of noise because of scaling by the standard deviation, as well as additive noise because it‘s subtracting the mean. Well, here the estimates of the mean and the standard deviation are noisy. And so, similar to dropout, batch norm therefore has a slight regularization effect. Because by adding noise to the hidden units, it‘s forcing the downstream hidden units not to rely too much on any one hidden unit. And so similar to dropout, it adds noise to the hidden layers and therefore has a very slight regularization effect. Because the noise added is quite small, this is not a huge regularization effect, and you might choose to use batch norm together with dropout, and you might use batch norm together with dropouts if you want the more powerful regularization effect of dropout. And maybe one other slightly non-intuitive effect is that, if you use a bigger mini-batch size, right, so if you use use a mini-batch size of, say, 512 instead of 64, by using a larger mini-batch size, you‘re reducing this noise and therefore also reducing this regularization effect. So that‘s one strange property of dropout which is that by using a bigger mini-batch size, you reduce the regularization effect. Having said this, I wouldn‘t really use batch norm as a regularizer, that‘s really not the intent of batch norm, but sometimes it has this extra intended or unintended effect on your learning algorithm. But, really, don‘t turn to batch norm as a regularization. Use it as a way to normalize your hidden units activations and therefore speed up learning. And I think the regularization is an almost unintended side effect. So I hope that gives you better intuition about what batch norm is doing. Before we wrap up the discussion on batch norm, there‘s one more detail I want to make sure you know, which is that batch norm handles data one mini-batch at a time. It computes mean and variances on mini-batches. So at test time, you try and make predictors, try and evaluate the neural network, you might not have a mini-batch of examples, you might be processing one single example at the time. So, at test time you need to do something slightly differently to make sure your predictions make sense. Like in the next and final video on batch norm, let‘s talk over the details of what you need to do in order to take your neural network trained using batch norm to make predictions.
Batch Norm at test time - 5m
0:00
Batch norm processes your data one mini batch at a time, but the test time you might need to process the examples one at a time. Let‘s see how you can adapt your network to do that. Recall that during training, here are the equations you‘d use to implement batch norm. Within a single mini batch, you‘d sum over that mini batch of the ZI values to compute the mean. So here, you‘re just summing over the examples in one mini batch. I‘m using M to denote the number of examples in the mini batch not in the whole training set. Then, you compute the variance and then you compute Z norm by scaling by the mean and standard deviation with Epsilon added for numerical stability. And then Z total is taking Z norm and rescaling by gamma and beta. So, notice that mu and sigma squared which you need for this scaling calculation are computed on the entire mini batch. But the test time you might not have a mini batch of 6428 or 2056 examples to process at the same time. So, you need some different way of coming up with mu and sigma squared. And if you have just one example, taking the mean and variance of that one example, doesn‘t make sense. So what‘s actually done? In order to apply your neural network and test time is to come up with some separate estimate of mu and sigma squared. And in typical implementations of batch norm, what you do is estimate this using a exponentially weighted average where the average is across the mini batches. So, to be very concrete here‘s what I mean. Let‘s pick some layer L and let‘s say you‘re going through mini batches X1, X2 together with the corresponding values of Y and so on. So, when training on X1 for that layer L, you get some mu L. And in fact, I‘m going to write this as mu for the first mini batch and that layer. And then when you train on the second mini batch for that layer and that mini batch,you end up with some second value of mu. And then for the fourth mini batch in this hidden layer, you end up with some third value for mu. So just as we saw how to use a exponentially weighted average to compute the mean of Theta one, Theta two, Theta three when you were trying to compute a exponentially weighted average of the current temperature, you would do that to keep track of what‘s the latest average value of this mean vector you‘ve seen. So that exponentially weighted average becomes your estimate for what the mean of the Zs is for that hidden layer and similarly, you use an exponentially weighted average to keep track of these values of sigma squared that you see on the first mini batch in that layer, sigma square that you see on second mini batch and so on. So you keep a running average of the mu and the sigma squared that you‘re seeing for each layer as you train the neural network across different mini batches. Then finally at test time, what you do is in place of this equation, you would just compute Z norm using whatever value your Z have, and using your exponentially weighted average of the mu and sigma square whatever was the latest value you have to do the scaling here. And then you would compute Z total on your one test example using that Z norm that we just computed on the left and using the beta and gamma parameters that you have learned during your neural network training process. So the takeaway from this is that during training time mu and sigma squared are computed on an entire mini batch of say 64 engine, 28 or some number of examples. But that test time, you might need to process a single example at a time. So, the way to do that is to estimate mu and sigma squared from your training set and there are many ways to do that. You could in theory run your whole training set through your final network to get mu and sigma squared. But in practice, what people usually do is implement and exponentially weighted average where you just keep track of the mu and sigma squared values you‘re seeing during training and use and exponentially the weighted average, also sometimes called the running average, to just get a rough estimate of mu and sigma squared and then you use those values of mu and sigma squared that test time to do the scale and you need the head and unit values Z. In practice, this process is pretty robust to the exact way you used to estimate mu and sigma squared. So, I wouldn‘t worry too much about exactly how you do this and if you‘re using a deep learning framework, they‘ll usually have some default way to estimate the mu and sigma squared that should work reasonably well as well. But in practice, any reasonable way to estimate the mean and variance of your head and unit values Z should work fine at test. So, that‘s it for batch norm and using it. I think you‘ll be able to train much deeper networks and get your learning algorithm to run much more quickly. Before we wrap up for this week, I want to share with you some thoughts on deep learning frameworks as well. Let‘s start to talk about that in the next video.
Softmax Regression - 11m
0:00
So far, the classification examples we‘ve talked about have used binary classification, where you had two possible labels, 0 or 1. Is it a cat, is it not a cat? What if we have multiple possible classes? There‘s a generalization of logistic regression called Softmax regression. The less you make predictions where you‘re trying to recognize one of C or one of multiple classes, rather than just recognize two classes. Let‘s take a look. Let‘s say that instead of just recognizing cats you want to recognize cats, dogs, and baby chicks. So I‘m going to call cats class 1, dogs class 2, baby chicks class 3. And if none of the above, then there‘s an other or a none of the above class, which I‘m going to call class 0. So here‘s an example of the images and the classes they belong to. That‘s a picture of a baby chick, so the class is 3. Cats is class 1, dog is class 2, I guess that‘s a koala, so that‘s none of the above, so that is class 0, class 3 and so on. So the notation we‘re going to use is, I‘m going to use capital C to denote the number of classes you‘re trying to categorize your inputs into. And in this case, you have four possible classes, including the other or the none of the above class. So when you have four classes, the numbers indexing your classes would be 0 through capital C minus one. So in other words, that would be zero, one, two or three. In this case, we‘re going to build a new XY, where the upper layer has four, or in this case the variable capital alphabet C upward units.
1:43
So N, the number of units upper layer which is layer L is going to equal to 4 or in general this is going to equal to C. And what we want is for the number of units in the upper layer to tell us what is the probability of each of these four classes. So the first node here is supposed to output, or we want it to output the probability that is the other class, given the input x, this will output probability there‘s a cat. Give an x, this will output probability as a dog. Give an x, that will output the probability. I‘m just going to abbreviate baby chick to baby C, given the input x.
2:29
So here, the output labels y hat is going to be a four by one dimensional vector, because it now has to output four numbers, giving you these four probabilities.
2:42
And because probabilities should sum to one, the four numbers in the output y hat, they should sum to one.
2:50
The standard model for getting your network to do this uses what‘s called a Softmax layer, and the output layer in order to generate these outputs. Then write down the map, then you can come back and get some intuition about what the Softmax there is doing.
3:06
So in the final layer of the neural network, you are going to compute as usual the linear part of the layers. So z, capital L, that‘s the z variable for the final layer. So remember this is layer capital L. So as usual you compute that as wL times the activation of the previous layer plus the biases for that final layer. Now having computer z, you now need to apply what‘s called the Softmax activation function.
3:38
So that activation function is a bit unusual for the Softmax layer, but this is what it does.
3:45
First, we‘re going to computes a temporary variable, which we‘re going to call t, which is e to the z L. So this is a part element-wise. So zL here, in our example, zL is going to be four by one. This is a four dimensional vector. So t Itself e to the zL, that‘s an element wise exponentiation. T will also be a 4.1 dimensional vector. Then the output aL, is going to be basically the vector t will normalized to sum to 1. So aL is going to be e to the zL divided by sum from J equal 1 through 4, because we have four classes of t substitute i. So in other words we‘re saying that aL is also a four by one vector, and the i element of this four dimensional vector. Let‘s write that, aL substitute i that‘s going to be equal to ti over sum of ti, okay? In case this math isn‘t clear, we‘ll do an example in a minute that will make this clearer. So in case this math isn‘t clear, let‘s go through a specific example that will make this clearer. Let‘s say that your computer zL, and zL is a four dimensional vector, let‘s say is 5, 2, -1, 3. What we‘re going to do is use this element-wise exponentiation to compute this vector t. So t is going to be e to the 5, e to the 2, e to the -1, e to the 3. And if you plug that in the calculator, these are the values you get. E to the 5 is 1484, e squared is about 7.4, e to the -1 is 0.4, and e cubed is 20.1. And so, the way we go from the vector t to the vector aL is just to normalize these entries to sum to one. So if you sum up the elements of t, if you just add up those 4 numbers you get 176.3. So finally, aL is just going to be this vector t, as a vector, divided by 176.3. So for example, this first node here, this will output e to the 5 divided by 176.3. And that turns out to be 0.842. So saying that, for this image, if this is the value of z you get, the chance of it being called zero is 84.2%. And then the next nodes outputs e squared over 176.3, that turns out to be 0.042, so this is 4.2% chance. The next one is e to -1 over that, which is 0.042. And the final one is e cubed over that, which is 0.114. So it is 11.4% chance that this is class number three, which is the baby C class, right? So there‘s a chance of it being class zero, class one, class two, class three. So the output of the neural network aL, this is also y hat. This is a 4 by 1 vector where the elements of this 4 by 1 vector are going to be these four numbers. Then we just compute it. So this algorithm takes the vector zL and is four probabilities that sum to 1. And if we summarize what we just did to math from zL to aL, this whole computation confusing exponentiation to get this temporary variable t and then normalizing, we can summarize this into a Softmax activation function and say aL equals the activation function g applied to the vector zL. The unusual thing about this particular activation function is that, this activation function g, it takes a input a 4 by 1 vector and it outputs a 4 by 1 vector. So previously, our activation functions used to take in a single row value input. So for example, the sigmoid and the value activation functions input the real number and output a real number. The unusual thing about the Softmax activation function is, because it needs to normalized across the different possible outputs, and needs to take a vector and puts in outputs of vector. So one of the things that a Softmax cross layer can represent, I‘m going to show you some examples where you have inputs x1, x2. And these feed directly to a Softmax layer that has three or four, or more output nodes that then output y hat. So I‘m going to show you a new network with no hidden layer, and all it does is compute z1 equals w1 times the input x plus b. And then the output a1, or y hat is just the Softmax activation function applied to z1. So in this neural network with no hidden layers, it should give you a sense of the types of things a Softmax function can represent. So here‘s one example with just raw inputs x1 and x2. A Softmax layer with C equals 3 upper classes can represent this type of decision boundaries. Notice this kind of several linear decision boundaries, but this allows it to separate out the data into three classes. And in this diagram, what we did was we actually took the training set that‘s kind of shown in this figure and train the Softmax cross fire with the upper labels on the data. And then the color on this plot shows fresh holding the upward of the Softmax cross fire, and coloring in the input base on which one of the three outputs have the highest probability. So we can maybe we kind of see that this is like a generalization of logistic regression with sort of linear decision boundaries, but with more than two classes [INAUDIBLE] class 0, 1, the class could be 0, 1, or 2. Here‘s another example of the decision boundary that a Softmax cross fire represents when three normal datasets with three classes. And here‘s another one, rIght, so this is a, but one intuition is that the decision boundary between any two classes will be more linear. That‘s why you see for example that decision boundary between the yellow and the various classes, that‘s the linear boundary where the purple and red linear in boundary between the purple and yellow and other linear decision boundary. But able to use these different linear functions in order to separate the space into three classes. Let‘s look at some examples with more classes. So it‘s an example with C equals 4, so that the green class and Softmax can continue to represent these types of linear decision boundaries between multiple classes. So here‘s one more example with C equals 5 classes, and here‘s one last example with C equals 6. So this shows the type of things the Softmax crossfire can do when there is no hidden layer of class, even much deeper neural network with x and then some hidden units, and then more hidden units, and so on. Then you can learn even more complex non-linear decision boundaries to separate out multiple different classes.
11:35
So I hope this gives you a sense of what a Softmax layer or the Softmax activation function in the neural network can do. In the next video, let‘s take a look at how you can train a neural network that uses a Softmax layer.
Training a softmax classifier - 10m
0:00
In the last video, you learned about the soft master, the softmax activation function. In this video, you deepen your understanding of softmax classification, and also learn how the training model that uses a softmax layer. Recall our earlier example where the output layer computes z[L] as follows. So we have four classes, c = 4 then z[L] can be (4,1) dimensional vector and we said we compute t which is this temporary variable that performs element y‘s exponentiation. And then finally, if the activation function for your output layer, g[L] is the softmax activation function, then your outputs will be this. It‘s basically taking the temporarily variable t and normalizing it to sum to 1. So this then becomes a(L). So you notice that in the z vector, the biggest element was 5, and the biggest probability ends up being this first probability. The name softmax comes from contrasting it to what‘s called a hard max which would have taken the vector Z and matched it to this vector. So hard max function will look at the elements of Z and just put a 1 in the position of the biggest element of Z and then 0s everywhere else. And so this is a very hard max where the biggest element gets a output of 1 and everything else gets an output of 0. Whereas in contrast, a softmax is a more gentle mapping from Z to these probabilities. So, I‘m not sure if this is a great name but at least, that was the intuition behind why we call it a softmax, all this in contrast to the hard max.
1:43
And one thing I didn‘t really show but had alluded to is that softmax regression or the softmax identification function generalizes the logistic activation function to C classes rather than just two classes. And it turns out that if C = 2, then softmax with C = 2 essentially reduces to logistic regression. And I‘m not going to prove this in this video but the rough outline for the proof is that if C = 2 and if you apply softmax, then the output layer, a[L], will output two numbers if C = 2, so maybe it outputs 0.842 and 0.158, right? And these two numbers always have to sum to 1. And because these two numbers always have to sum to 1, they‘re actually redundant. And maybe you don‘t need to bother to compute two of them, maybe you just need to compute one of them. And it turns out that the way you end up computing that number reduces to the way that logistic regression is computing its single output. So that wasn‘t much of a proof but the takeaway from this is that softmax regression is a generalization of logistic regression to more than two classes. Now let‘s look at how you would actually train a neural network with a softmax output layer. So in particular, let‘s define the loss functions you use to train your neural network. Let‘s take an example. Let‘s see of an example in your training set where the target output, the ground true label is 0 1 0 0. So the example from the previous video, this means that this is an image of a cat because it falls into Class 1. And now let‘s say that your neural network is currently outputting y hat equals, so y hat would be a vector probability is equal to sum to 1. 0.1, 0.4, so you can check that sums to 1, and this is going to be a[L]. So the neural network‘s not doing very well in this example because this is actually a cat and assigned only a 20% chance that this is a cat. So didn‘t do very well in this example.
3:52
So what‘s the last function you would want to use to train this neural network? In softmax classification, they‘ll ask me to produce this negative sum of j=1 through 4. And it‘s really sum from 1 to C in the general case. We‘re going to just use 4 here, of yj log y hat of j. So let‘s look at our single example above to better understand what happens. Notice that in this example, y1 = y3 = y4 = 0 because those are 0s and only y2 = 1. So if you look at this summation, all of the terms with 0 values of yj were equal to 0. And the only term you‘re left with is -y2 log y hat 2, because we use sum over the indices of j, all the terms will end up 0, except when j is equal to 2. And because y2 = 1, this is just -log y hat 2. So what this means is that, if your learning algorithm is trying to make this small because you use gradient descent to try to reduce the loss on your training set. Then the only way to make this small is to make this small. And the only way to do that is to make y hat 2 as big as possible.
5:18
And these are probabilities, so they can never be bigger than 1. But this kind of makes sense because x for this example is the picture of a cat, then you want that output probability to be as big as possible. So more generally, what this loss function does is it looks at whatever is the ground true class in your training set, and it tries to make the corresponding probability of that class as high as possible. If you‘re familiar with maximum likelihood estimation statistics, this turns out to be a form of maximum likelyhood estimation. But if you don‘t know what that means, don‘t worry about it. The intuition we just talked about will suffice.
5:54
Now this is the loss on a single training example. How about the cost J on the entire training set. So, the class of setting of the parameters and so on, of all the ways and biases, you define that as pretty much what you‘d guess, sum of your entire training sets are the loss, your learning algorithms predictions are summed over your training samples. And so, what you do is use gradient descent in order to try to minimize this class. Finally, one more implementation detail. Notice that because C is equal to 4, y is a 4 by 1 vector, and y hat is also a 4 by 1 vector.
6:34
So if you‘re using a vectorized limitation, the matrix capital Y is going to be y(1), y(2), through y(m), stacked horizontally. And so for example, if this example up here is your first training example then the first column of this matrix Y will be 0 1 0 0 and then maybe the second example is a dog, maybe the third example is a none of the above, and so on. And then this matrix Y will end up being a 4 by m dimensional matrix. And similarly, Y hat will be y hat 1 stacked up horizontally going through y hat m, so this is actually y hat 1.
7:19
All the output on the first training example then y hat will these 0.3, 0.2, 0.1, and 0.4, and so on. And y hat itself will also be 4 by m dimensional matrix. Finally, let‘s take a look at how you‘d implement gradient descent when you have a softmax output layer. So this output layer will compute z[L] which is C by 1 in our example, 4 by 1 and then you apply the softmax attribution function to get a[L], or y hat.
7:53
And then that in turn allows you to compute the loss. So with talks about how to implement the forward propagation step of a neural network to get these outputs and to compute that loss. How about the back propagation step, or gradient descent? Turns out that the key step or the key equation you need to initialize back prop is this expression, that the derivative with respect to z at the loss layer, this turns out, you can compute this y hat, the 4 by 1 vector, minus y, the 4 by 1 vector. So you notice that all of these are going to be 4 by 1 vectors when you have 4 classes and C by 1 in the more general case.
8:34
And so this going by our usual definition of what is dz, this is the partial derivative of the class function with respect to z[L]. If you are an expert in calculus, you can derive this yourself. Or if you‘re an expert in calculus, you can try to derive this yourself, but using this formula will also just work fine, if you have a need to implement this from scratch. With this, you can then compute dz[L] and then sort of start off the back prop process to compute all the derivatives you need throughout your neural network. But it turns out that in this week‘s primary exercise, we‘ll start to use one of the deep learning program frameworks and for those primary frameworks, usually it turns out you just need to focus on getting the forward prop right. And so long as you specify it as a primary framework, the forward prop pass, the primary framework will figure out how to do back prop, how to do the backward pass for you.
9:27
So this expression is worth keeping in mind for if you ever need to implement softmax regression, or softmax classification from scratch. Although you won‘t actually need this in this week‘s primary exercise because the primary framework you use will take care of this derivative computation for you. So that‘s it for softmax classification, with it you can now implement learning algorithms to characterized inputs into not just one of two classes, but one of C different classes. Next, I want to show you some of the deep learning programming frameworks which can make you much more efficient in terms of implementing deep learning algorithms. Let‘s go on to the next video to discuss that.
Deep learning frameworks - 4m
0:00
You‘ve learned to implement deep learning algorithms more or less from scratch using Python and NumPY. And I‘m glad you did that because I wanted you to understand what these deep learning algorithms are really doing. But you find unless you implement more complex models, such as convolutional neural networks or recurring neural networks, or as you start to implement very large models that is increasingly not practical, at least for most people, is not practical to implement everything yourself from scratch. Fortunately, there are now many good deep learning software frameworks that can help you implement these models. To make an analogy, I think that hopefully you understand how to do a matrix multiplication and you should be able to implement how to code, to multiply two matrices yourself. But as you build very large applications, you‘ll probably not want to implement your own matrix multiplication function but instead you want to call a numerical linear algebra library that could do it more efficiently for you. But this still helps that you understand how multiplying two matrices work. So I think deep learning has now matured to that point where it‘s actually more practical you‘ll be more efficient doing some things with some of the deep learning frameworks. So let‘s take a look at the frameworks out there. Today, there are many deep learning frameworks that makes it easy for you to implement neural networks, and here are some of the leading ones. Each of these frameworks has a dedicated user and developer community and I think each of these frameworks is a credible choice for some subset of applications. There are lot of people writing articles comparing these deep learning frameworks and how well these deep learning frameworks changes. And because these frameworks are often evolving and getting better month to month, I‘ll leave you to do a few internet searches yourself, if you want to see the arguments on the pros and cons of some of these frameworks. But I think many of these frameworks are evolving and getting better very rapidly. So rather than too strongly endorsing any of these frameworks I want to share with you the criteria I would recommend you use to choose frameworks. One important criteria is the ease of programming, and that means both developing the neural network and iterating on it as well as deploying it for production, for actual use, by thousands or millions or maybe hundreds of millions of users, depending on what you‘re trying to do. A second important criteria is running speeds, especially training on large data sets, some frameworks will let you run and train your neural network more efficiently than others. And then, one criteria that people don‘t often talk about but I think is important is whether or not the framework is truly open. And for a framework to be truly open, it needs not only to be open source but I think it needs good governance as well. Unfortunately, in the software industry some companies have a history of open sourcing software but maintaining single corporation control of the software. And then over some number of years, as people start to use the software, some companies have a history of gradually closing off what was open source, or perhaps moving functionality into their own proprietary cloud services. So one thing I pay a bit of attention to is how much you trust that the framework will remain open source for a long time rather than just being under the control of a single company, which for whatever reason may choose to close it off in the future even if the software is currently released under open source. But at least in the short term depending on your preferences of language, whether you prefer Python or Java or C++ or something else, and depending on what application you‘re working on, whether this can be division or natural language processing or online advertising or something else, I think multiple of these frameworks could be a good choice. So that said on programming frameworks by providing a higher level of abstraction than just a numerical linear algebra library, any of these program frameworks can make you more efficient as you develop machine learning applications.
TensorFlow - 16m
0:00
Welcome to the last video for this week. There are many great, deep learning programming frameworks. One of them is TensorFlow. I‘m excited to help you start to learn to use TensorFlow. What I want to do in this video is show you the basic structure of a TensorFlow program, and then leave you to practice, learn more details, and practice them yourself in this week‘s problem exercise. This week‘s problem exercise will take some time to do so please be sure to leave some extra time to do it. As a motivating problem, let‘s say that you have some cost function J that you want to minimize. And for this example, I‘m going to use this highly simple cost function J(w) = w squared- 10w + 25. So that‘s the cost function. You might notice that this function is actually (w- 5) squared. If you expand out this quadratic, you get the expression above, and so the value of w that minimizes this is w = 5. But let‘s say we didn‘t know that, and you just have this function. Let‘s see how you can implement something in TensorFlow to minimize this. Because a very similar structure of program can be used to train neural networks where you can have some complicated cost function J(w, b) depending on all the parameters of your neural network. And the, similarly, you‘ll be able to use TensorFlow so automatically try to find values of w and b that minimize this cost function. But let‘s start with the simpler example on the left. So, I‘m running Python in my Jupyter notebook and to start up TensorFlow, you import numpy as np and it‘s idiomatic to use import tensorflow as tf. Next, let me define the parameter w. So in TensorFlow, you‘re going to use tf.Variable to define a parameter.
2:01
Dtype=tf.float32.
2:08
And then let‘s define the cost function. So remember the cost function was w squared- 10w + 25. So let me use tf.add. So I‘m going to have w squared + tf.multiply. So the second term was -10.w. And then I‘m going to add that 25. So let me put another tf.add over there. So that defines the cost J that we had. And then, I‘m going to write train = tf.train.GradientDescentOptimizer. Let‘s use a learning rate of 0.01 and the goal is to minimize the cost. And finally, the following few lines are quite idiomatic. Init = tf.global_variables_initializer and then session = tf.Sessions. So it starts a TensorFlow session. Session.run init to initialize global variables. And then, for TensorFlow‘s evaluative variable, we‘re going to use sess.run w. We haven‘t done anything yet. So with this line above, initialize w to zero and define a cost function. We define train to be our learning algorithm which uses a GradientDescentOptimizer to minimize the cost function. But we haven‘t actually run the learning algorithm yet, so session.run, we evaluate w, and let me print(session.run(w). So if we run that, it evaluates w to be equal to 0 because we haven‘t run anything yet. Now, let‘s do session.run train. So what this will do is run one step of GradientDescent. And then let‘s evaluate the value of w after one step of GradientDescent and print that. So we do that of the one set of GradientDescent, w is now 0.1. Let‘s now run 1000 iterations of GradientDescent so .run(train).
4:35
And lets then print(session.run w). So this would run a 1,000 iterations of GradientDescent, and at the end w ends up being 4.9999. Remember, we said that we‘re minimizing w- 5 squared so the absolute value of w is 5 and it got very close to this. So hope this gives you a sense of the broad structure of a TensorFlow program. And as you do the following exercise and play with more TensorFlow course yourself, some of these functions that I‘m using here will become more familiar. Some things to notice about this, w is the parameter which I optimize so we‘re going to declare that as a variable. And notice that all we had to do was define a cost function using these add and multiply and so on functions. And TensorFlow knows automatically how to take derivatives with respect to the add and multiply as was other functions. Which is why you only had to implement basically four prop and it can figure out how to do the back problem or the grading computation. Because that‘s already built in to the add and multiply as well as the squaring functions. By the way, in cases notation seems really ugly, TensorFlow actually has overloaded the computation for the usual plus, minus, and so on. So you could also just write this nicer format for the cost and comment that out and rerun this and get the same result. So once w is declared to be a TensorFlow variable, the squaring, multiplication, adding, and subtraction operations are overloaded. So you don‘t need to use this uglier syntax that I had above. Now, there‘s just one more feature of TensorFlow that I want to show you, which is this example minimize a fix function of w. One of the function you want to minimize is the function of your training set. So whether you have some training data, x and when you‘re training a neural network the training data x can change. So how do you get training data into a TensorFlow program? So I‘m going to define t and x which is think of this as train a relevant training data or really the training data with both x and y, but we only have x in this example. So just going to define x with placeholder and it‘s going to be a type float32 and let‘s make this a [3,1] array. And what I‘m going to do is whereas the cost here have fixed coefficients in front of the three terms in this quadratic was 1 times w squared- 10w + 25. We could turn these numbers 1- 10 and 25 into data. So what I‘m going to do is replace the cost with cost = x[0][0]w squared + x[1][0]w + x[2][0]. Well, times 1. So now x becomes sort of like data that controls the coefficients of this quadratic function. And this placeholder function tells TensorFlow that x is something that you provide the values for later.
8:09
So let‘s define another array, coefficient = np.array,
8:19
[1.], [-10.] and yes, the loss value was [25.]. So that‘s going to be the data that we‘re going to plug into x.
8:32
So finally we need a way to get this array coefficients into the variable x and the syntax to do that is just doing the training step. That the values for will need to be provided for x, I‘m going to set here, feed_dict = x:coefficients,
8:58
And I‘m going to change this, I‘m going to copy and paste put that there as well. All right, hopefully, I didn‘t have any syntax errors. Let‘s try re-running this and we get the same results hopefully as before.
9:14
And now, if you want to change the coefficients of this quadratic function, let‘s say you take this [-10.] and change it to 20, [-20]. And let‘s change this to 100. So this is now a function x- 10 squared. And if I re-run this, hopefully, I find that the value that minimizes x- 10 squared is w = 10. Let‘s see, cool, great and we get w very close to 10 after running 1,000 integrations of GradientDescent. So what you see more of when you do that from exercise is that a placeholder in TensorFlow is a variable whose value you assign later. And this is a convenient way to get your training data into the cost function. And the way you get your data into the cost function is with this syntax when you‘re running a training iteration to use the feed_dict to set x to be equal to the coefficients here. And if you are doing mini batch GradientDescent where on each iteration, you need to plug in a different mini batch, then on different iterations you use the feed_dict to feed in different subsets of your training sets. Different mini batches into where your cost function is expecting to see data. So hopefully this gives you a sense of what TensorFlow can do. And the thing that makes this so powerful is all you need to do is specify how to compute the cost function. And then, it takes derivatives and it can apply a gradient optimizer or an add-on optimizer or some other optimizer with just pretty much one or two lines of codes. So here‘s the code again. I‘ve cleaned this up just a little bit. And in case some of these functions or variables seem a little bit mysterious to use, they will become more familiar after you‘ve practiced with it a couple times by working through their problem exercise. Just one last thing I want to mention. These three lines of code are quite idiomatic in TensorFlow, and what some programmers will do is use this alternative format. Which basically does the same thing. Set session to tf.Session() to start the session, and then use the session to run init, and then use the session to evaluate, say, w and then print the result. But this with construction is used in a number of TensorFlow programs as well. It more or less means the same thing as the thing on the left. But the with command in Python is a little bit better at cleaning up in cases an error in exception while executing this inner loop. So you see this is the following exercise as well. So what is this code really doing? Let‘s focus on this equation. The heart of a TensorFlow program is something to compute at cost, and then TensorFlow automatically figures out the derivatives in how to minimize that costs. So what this equation or what this line of code is doing is allowing TensorFlow to construct a computation draft. And a computation draft does the following, it takes x[0][0], it takes w and then it goes w gets squared.
12:33
And then x[0][0] gets multiplied with w squared, so you have x[0][0]w squared, and so on, right? And eventually, you know, this gets built up to compute this xw, x[0][0]w squared + x[1][0]w + and so on. And so eventually, you get the cost function. And so the last term to be added would be x [2][0] where it gets added to be the cost. I won‘t write other format for the cost. And the nice thing about TensorFlow is that by implementing basically four prop applications through this computation draft, the computed cost, TensorFlow already has that built in. All the necessary backward functions. So remember how training a deep neural network has a set of forward functions instead of backward functions. Programming frameworks like Tensor Flow have already built-in the necessary backward functions. Which is why by using the built-in functions to compute the forward function, it can automatically do the backward functions as well to implement back propagation through even very complicated functions and compute derivatives for you. So that‘s why you don‘t need to explicitly implement back prop. This is one of the things that makes the programming frameworks help you become really efficient. If you look at the TensorFlow documentation, I just want to point out that the TensorFlow documentation uses a slightly different notation than I did for drawing the computation draft. So it uses x[0][0] w. And then, rather than writing the value, like w squared, the TensorFlow documentation tends to just write the operation. So this would be a, square operation, and then these two get combined in the multiplication operation and so on. And then, a final note, I guess that would be an addition operation where you add x to 0 to find the final value. So for the purposes of this class, I thought that this notation for the computation draft would be easier for you to understand. But if you look at the TensorFlow documentation, if you look at the computation drafts in the documentation, you see this alternative convention where the notes are labeled with the operations rather than with the value. But both of these representations represent basically the same computation draft. And there are a lot of things that you can with just one line of code in programming frameworks. For example, if you don‘t want to use GradientDescent, but instead you want to use the add-on Optimizer by changing this line of code, you can very quickly swap it, swap in a better optimization algorithm. So all the modern deep learning programming framework support things like this and makes it really easy for you to code up even pretty complex neural networks. So I hope this is helpful for giving you a sense of the typical structure of a TensorFlow program. To recap the material from this week, you saw how to systematically organize the hyper parameter search process. We also talked about batch normalization and how you can use that to speed up training of your neural networks. And finally, we talked about programming frameworks of deep learning. There are many great programming frameworks. And we had this last video focusing on TensorFlow. With that, I hope you enjoyed this week‘s programming exercise and that helps you gain even more familiarity with these ideas.
原文地址:https://www.cnblogs.com/keyshaw/p/10703262.html