实际梯度下降中的两个重要调节方面

Gradient Descent in Practice I - Feature Scaling（特征归一化）

调整处理X的范围，以提高梯度下降效果和减小迭代次数。

Note: [6:20 - The average size of a house is 1000 but 100 is accidentally written instead]

We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally:

These aren‘t exact requirements; we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges, give or take a few.

Two techniques to help with this are feature scaling and mean normalization. Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero. To implement both of these techniques, adjust your input values as shown in this formula:

输入值减去输入值平均值除以输入值的范围

Where μi is the average of all the values for feature (i) and si is the range of values (max - min), or si is the standard deviation.

Note that dividing by the range, or dividing by the standard deviation, give different results. The quizzes in this course use range - the programming exercises use standard deviation.

For example, if xi represents housing prices with a range of 100 to 2000 and a mean value of 1000, then,

Gradient Descent in Practice II - Learning Rate

Note: [5:20 - the x -axis label in the right graph should be θ rather than No. of iterations ]

Debugging gradient descent. Make a plot with number of iterations on the x-axis. Now plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, then you probably need to decrease α.

Automatic convergence test. Declare convergence if J(θ) decreases by less than E in one iteration, where E is some small value such as 10?3. However in practice it‘s difficult to choose this threshold value.

It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration.

To summarize:

If α is too small: slow convergence.

If α is too large: ?may not decrease on every iteration and thus may not converge.

时间： 2024-12-11 16:28:39

实际梯度下降中的两个重要调节方面

Gradient Descent in Practice I - Feature Scaling（特征归一化）

Gradient Descent in Practice II - Learning Rate

实际梯度下降中的两个重要调节方面的相关文章

梯度下降中的学习率如何确定

Stanford大学机器学习公开课（二）：监督学习应用与梯度下降

《机器学习》第一周一元回归法 | 模型和代价函数，梯度下降

NN优化方法对比：梯度下降、随机梯度下降和批量梯度下降

机器学习笔记02：多元线性回归、梯度下降和Normal equation

第二集监督学习的应用梯度下降

机器学习-斯坦福：学习笔记2-监督学习应用与梯度下降

随机梯度下降（SGD）

ng机器学习视频笔记（十五） ——大数据机器学习(随机梯度下降与map reduce)