FITTING A MODEL VIA CLOSED-FORM EQUATIONS VS. GRADIENT DESCENT VS STOCHASTIC GRADIENT DESCENT VS MINI-BATCH LEARNING. WHAT IS THE DIFFERENCE?

FITTING A MODEL VIA CLOSED-FORM EQUATIONS VS. GRADIENT DESCENT VS STOCHASTIC GRADIENT DESCENT VS MINI-BATCH LEARNING. WHAT IS THE DIFFERENCE?

In order to explain the differences between alternative approaches to estimating the parameters of a model, let‘s take a look at a concrete example: Ordinary Least Squares (OLS) Linear Regression. The illustration below shall serve as a quick reminder to recall the different components of a simple linear regression model:

In Ordinary Least Squares (OLS) Linear Regression, our goal is to find the line (or hyperplane) that minimizes the vertical offsets. Or, in other words, we define the best-fitting line as the line that minimizes the sum of squared errors (SSE) or mean squared error (MSE) between our target variable (y) and our predicted output over all samples i in our dataset of size n.

Now, we can implement a linear regression model for performing ordinary least squares regression using one of the following approaches:

  • Solving the model parameters analytically (closed-form equations)
  • Using an optimization algorithm (Gradient Descent, Stochastic Gradient Descent, Newton‘s Method, Simplex Method, etc.)

1) NORMAL EQUATIONS (CLOSED-FORM SOLUTION)

The closed-form solution may (should) be preferred for "smaller" datasets -- if computing (a "costly") matrix inverse is not a concern. For very large datasets, or datasets where the inverse of XTX may not exist (the matrix is non-invertible or singular, e.g., in case of perfect multicollinearity), the GD or SGD approaches are to be preferred. The linear function (linear regression model) is defined as:

where y is the response variable, x is an m-dimensional sample vector, and w is the weight vector (vector of coefficients). Note that w0 represents the y-axis intercept of the model and therefore x0=1. Using the closed-form solution (normal equation), we compute the weights of the model as follows:

2) GRADIENT DESCENT (GD)

Using the Gradient Decent (GD) optimization algorithm, the weights are updated incrementally after each epoch (= pass over the training dataset).

The cost function J(⋅), the sum of squared errors (SSE), can be written as:

The magnitude and direction of the weight update is computed by taking a step in the opposite direction of the cost gradient

where η is the learning rate. The weights are then updated after each epoch via the following update rule:

where Δw is a vector that contains the weight updates of each weight coefficient w, which are computed as follows:

Essentially, we can picture GD optimization as a hiker (the weight coefficient) who wants to climb down a mountain (cost function) into a valley (cost minimum), and each step is determined by the steepness of the slope (gradient) and the leg length of the hiker (learning rate). Considering a cost function with only a single weight coefficient, we can illustrate this concept as follows:

3) STOCHASTIC GRADIENT DESCENT (SGD)

In GD optimization, we compute the cost gradient based on the complete training set; hence, we sometimes also call itbatch GD. In case of very large datasets, using GD can be quite costly since we are only taking a single step for one pass over the training set -- thus, the larger the training set, the slower our algorithm updates the weights and the longer it may take until it converges to the global cost minimum (note that the SSE cost function is convex).

In Stochastic Gradient Descent (SGD; sometimes also referred to as iterative or on-line GD), we don‘t accumulate the weight updates as we‘ve seen above for GD:

Instead, we update the weights after each training sample:

Here, the term "stochastic" comes from the fact that the gradient based on a single training sample is a "stochastic approximation" of the "true" cost gradient. Due to its stochastic nature, the path towards the global cost minimum is not "direct" as in GD, but may go "zig-zag" if we are visualizing the cost surface in a 2D space. However, it has been shown that SGD almost surely converges to the global cost minimum if the cost function is convex (or pseudo-convex)[1]. Furthermore, there are different tricks to improve the GD-based learning, for example:

  • An adaptive learning rate η Choosing a decrease constant d that shrinks the learning rate over time:
  • Momentum learning by adding a factor of the previous gradient to the weight update for faster updates:

A NOTE ABOUT SHUFFLING

There are several different flavors of SGD, which can be all seen throughout the literature. Let‘s take a look at the three most common variants:

A)
  • randomly shuffle samples in the training set

    • for one or more epochs, or until approx. cost minimum is reached

      • for training sample i

        • compute gradients and perform weight updates
B)
  • for one or more epochs, or until approx. cost minimum is reached

    • randomly shuffle samples in the training set

      • for training sample i

        • compute gradients and perform weight updates
C)
  • for iterations t, or until approx. cost minimum is reached:

    • draw random sample from the training set

      • compute gradients and perform weight updates

In scenario A [3], we shuffle the training set only one time in the beginning; whereas in scenario B, we shuffle the training set after each epoch to prevent repeating update cycles. In both scenario A and scenario B, each training sample is only used once per epoch to update the model weights.

In scenario C, we draw the training samples randomly with replacement from the training set [2]. If the number of iterationst is equal to the number of training samples, we learn the model based on a bootstrap sample of the training set.

4) MINI-BATCH GRADIENT DESCENT (MB-GD)

Mini-Batch Gradient Descent (MB-GD) a compromise between batch GD and SGD. In MB-GD, we update the model based on smaller groups of training samples; instead of computing the gradient from 1 sample (SGD) or all n training samples (GD), we compute the gradient from 1 < k < n training samples (a common mini-batch size is k=50).

MB-GD converges in fewer iterations than GD because we update the weights more frequently; however, MB-GD let‘s us utilize vectorized operation, which typically results in a computational performance gain over SGD.

REFERENCES

  • [1] Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations". Online Learning and Neural Networks. Cambridge University Press. ISBN 978-0-521-65263-6
  • [2] Bottou, Léon. "Large-scale machine learning with SGD." Proceedings of COMPSTAT‘2010. Physica-Verlag HD, 2010. 177-186.
  • [3] Bottou, Léon. "SGD tricks." Neural Networks: Tricks of the Trade. Springer Berlin Heidelberg, 2012. 421-436.
时间: 2024-10-29 19:11:37

FITTING A MODEL VIA CLOSED-FORM EQUATIONS VS. GRADIENT DESCENT VS STOCHASTIC GRADIENT DESCENT VS MINI-BATCH LEARNING. WHAT IS THE DIFFERENCE?的相关文章

Python之路-(Django(csrf,中间件,缓存,信号,Model操作,Form操作))

csrf 中间件 缓存 信号 Model操作 Form操作 csrf: 用 django 有多久,我跟 csrf 这个概念打交道就有久了. 每次初始化一个项目时都能看到 django.middleware.csrf.CsrfViewMiddleware 这个中间件 每次在模板里写 form 时都知道要加一个 {% csrf_token %} tag 每次发 ajax POST 请求,都需要加一个 X_CSRFTOKEN 的 header 什么是 CSRF CSRF, Cross Site Req

最大似然估计实例 | Fitting a Model by Maximum Likelihood (MLE)

参考:Fitting a Model by Maximum Likelihood 最大似然估计是用于估计模型参数的,首先我们必须选定一个模型,然后比对有给定的数据集,然后构建一个联合概率函数,因为给定了数据集,所以该函数就是以模型参数为自变量的函数,通过求导我们就能得到使得该函数值(似然值)最大的模型参数了. Maximum-Likelihood Estimation (MLE) is a statistical technique for estimating model parameters

随机梯度下降(stochastic gradient descent),批梯度下降(batch gradient descent),正规方程组(The normal equations)

对于一个线性回归问题有 为了使得预测值h更加接近实际值y,定义 J越小,预测更加可信,可以通过对梯度的迭代来逼近极值 批梯度下降(batch gradient descent)(the entire training set before taking a single step) 随机梯度下降(stochastic gradient descent)(gets θ "close" to the minimum much faster than batch gradient desce

提高神经网络的学习方式Improving the way neural networks learn

When a golf player is first learning to play golf, they usually spend most of their time developing a basic swing. Only gradually do they develop other shots, learning to chip, draw and fade the ball, building on and modifying their basic swing. In a

Training Deep Neural Networks

http://handong1587.github.io/deep_learning/2015/10/09/training-dnn.html  //转载于 Training Deep Neural Networks Published: 09 Oct 2015  Category: deep_learning Tutorials Popular Training Approaches of DNNs?—?A Quick Overview https://medium.com/@asjad/po

BP反向传播算法的工作原理How the backpropagation algorithm works

In the last chapter we saw how neural networks can learn their weights and biases using the gradient descent algorithm. There was, however, a gap in our explanation: we didn't discuss how to compute the gradient of the cost function. That's quite a g

(转)Introduction to Gradient Descent Algorithm (along with variants) in Machine Learning

Introduction Optimization is always the ultimate goal whether you are dealing with a real life problem or building a software product. I, as a computer science student, always fiddled with optimizing my code to the extent that I could brag about its

Deep Learning 12_深度学习UFLDL教程:Sparse Coding_exercise(斯坦福大学深度学习教程)

前言 理论知识:UFLDL教程.Deep learning:二十六(Sparse coding简单理解).Deep learning:二十七(Sparse coding中关于矩阵的范数求导).Deep learning:二十九(Sparse coding练习) 实验环境:win7, matlab2015b,16G内存,2T机械硬盘 本节实验比较不好理解也不好做,我看很多人最后也没得出好的结果,所以得花时间仔细理解才行. 实验内容:Exercise:Sparse Coding.从10张512*51

UFLDL教程笔记及练习答案六(稀疏编码与稀疏编码自编码表达)

稀疏编码(SparseCoding) sparse coding也是deep learning中一个重要的分支,同样能够提取出数据集很好的特征(稀疏的).选择使用具有稀疏性的分量来表示我们的输入数据是有原因的,因为绝大多数的感官数据,比如自然图像,可以被表示成少量基本元素的叠加,在图像中这些基本元素可以是面或者线. 稀疏编码算法的目的就是找到一组基向量使得我们能将输入向量x表示成这些基向量的线性组合: 这里构成的基向量要求是超完备的,即要求k大于n,这样的方程就大多情况会有无穷多个解,此时我们给