关于subGradent descent和Proximal gradient descent的迭代速度

clc;clear;
D=500;N=10000;thre=10e-8;zeroRatio=0.6;
X = randn(N,D);
r=rand(1,D);
r=sign(1-2*r).*(2+2*r);
perm=randperm(D);r(perm(1:floor(D*zeroRatio)))=0;
Y = X*r‘ + randn(N,1)*.1; % small added noise
lamda=1;stepsize=10e-5;
%%% y=x*beta‘
%%% Loss=0.5*(y-x*beta‘)_2++lamda|beta|

%%%% GD
%%% al_y/al_beta=sigma(x_i*(x_i*beta‘-y_i)+fabs(lamda))
beta=zeros(size(r));

pre_error=inf;new_error=0;
while abs(pre_error-new_error)>thre
    pre_error=new_error;
    tmp=0;
    for j=1:length(Y)
        tmp=tmp+X(j,:)*(X(j,:)*beta‘-Y(j,:));
    end
    beta=beta-stepsize*(tmp+lamda);
    new_error=lamda*norm(beta,1);
    for j=1:length(Y)
        new_error=new_error+(Y(j,:)-X(j,:)*beta‘)*(Y(j,:)-X(j,:)*beta‘);
    end
    disp(new_error)
end

% %%%% Proximal GD
% Loss=0.5*(y-x*beta‘)_2++lamda|beta|=g(x)+h(x)
% 左边可导 x_{t+1}=x_{t}-stepsize*sigma(x_i*(x_i*beta‘-y_i)
% X_{t+1}=prox_{l1-norm ball}(x_{t+1})=

disp(‘pgd‘)
beta_pgd=zeros(size(r));
pre_error=inf;new_error=0;
while abs(pre_error-new_error)>thre
    pre_error=new_error;
    tmp=0;
    for j=1:length(Y)
        tmp=tmp+X(j,:)*(X(j,:)*beta_pgd‘-Y(j,:));
    end
    newbeta=beta_pgd-stepsize*(tmp+lamda); add=stepsize*lamda;
    pidx=newbeta>add;beta_pgd(pidx)=newbeta(pidx)-add;
    zeroidx=newbeta<abs(add);beta_pgd(zeroidx)=0;
    nidx=newbeta+add<0;beta_pgd(nidx)=newbeta(nidx)+add;

    new_error=lamda*norm(beta,1);
    for j=1:length(Y)
        new_error=new_error+(Y(j,:)-X(j,:)*beta‘)*(Y(j,:)-X(j,:)*beta‘);
    end
    disp(new_error)
end

PGD的代码说明见下图

PGD主要是project那一步有解析解,速度快

subGradent收敛速度O(1/sqrt(T))

时间: 2024-10-19 10:48:44

关于subGradent descent和Proximal gradient descent的迭代速度的相关文章

FITTING A MODEL VIA CLOSED-FORM EQUATIONS VS. GRADIENT DESCENT VS STOCHASTIC GRADIENT DESCENT VS MINI-BATCH LEARNING. WHAT IS THE DIFFERENCE?

FITTING A MODEL VIA CLOSED-FORM EQUATIONS VS. GRADIENT DESCENT VS STOCHASTIC GRADIENT DESCENT VS MINI-BATCH LEARNING. WHAT IS THE DIFFERENCE? In order to explain the differences between alternative approaches to estimating the parameters of a model,

Gradient Descent 和 Stochastic Gradient Descent(随机梯度下降法)

Gradient Descent(Batch Gradient)也就是梯度下降法是一种常用的的寻找局域最小值的方法.其主要思想就是计算当前位置的梯度,取梯度反方向并结合合适步长使其向最小值移动.通过柯西施瓦兹公式可以证明梯度反方向是下降最快的方向. 经典的梯度下降法利用下式更新参量,其中J(θ)是关于参量θ的损失函数,梯度下降法通过不断更新θ来最小化损失函数.当损失函数只有一个global minimal时梯度下降法一定会收敛于最小值(在学习率不是很大的情况下) 上式的梯度是基于所有数据的,如果

(转) An overview of gradient descent optimization algorithms

An overview of gradient descent optimization algorithms Table of contents: Gradient descent variantsChallenges Batch gradient descent Stochastic gradient descent Mini-batch gradient descent Gradient descent optimization algorithms Momentum Nesterov a

(转)Introduction to Gradient Descent Algorithm (along with variants) in Machine Learning

Introduction Optimization is always the ultimate goal whether you are dealing with a real life problem or building a software product. I, as a computer science student, always fiddled with optimizing my code to the extent that I could brag about its

An overview of gradient descent optimization algorithms

原文地址:An overview of gradient descent optimization algorithms An overview of gradient descent optimization algorithms Note: If you are looking for a review paper, this blog post is also available as an article on arXiv. Update 15.06.2017: Added deriva

Gradient Descent with Momentum

在Batch Gradient Descent及Mini-batch Gradient Descent, Stochastic Gradient Descent(SGD)算法中,每一步优化相对于之前的操作,都是独立的.每一次迭代开始,算法都要根据更新后的Cost Function来计算梯度,并用该梯度来做Gradient Descent. Momentum Gradient Descent相较于前三种算法,虽然也会根据Cost Function来计算当前的梯度,但是却不直接用此梯度去做Gradi

Stochastic Gradient Descent

Stochastic Gradient Descent 一.从Multinomial Logistic模型说起 1.Multinomial Logistic 令为维输入向量; 为输出label;(一共k类); 为模型参数向量: Multinomial Logistic模型是指下面这种形式: 其中: 例如:时,输出label为0和1,有: 2.Maximum Likelihood Estimate and Maximum a Posteriori Estimate (1).Maximum Like

梯度下降(Gradient Descent)小结

在求解机器学习算法的模型参数,即无约束优化问题时,梯度下降(Gradient Descent)是最常采用的方法之一,另一种常用的方法是最小二乘法.这里就对梯度下降法做一个完整的总结. 1. 梯度 在微积分里面,对多元函数的参数求?偏导数,把求得的各个参数的偏导数以向量的形式写出来,就是梯度.比如函数f(x,y), 分别对x,y求偏导数,求得的梯度向量就是(?f/?x, ?f/?y)T,简称grad f(x,y)或者▽f(x,y).对于在点(x0,y0)的具体梯度向量就是(?f/?x0, ?f/?

梯度下降(Gradient descent)

首先,我们继续上一篇文章中的例子,在这里我们增加一个特征,也即卧室数量,如下表格所示: 因为在上一篇中引入了一些符号,所以这里再次补充说明一下: x‘s:在这里是一个二维的向量,例如:x1(i)第i间房子的大小(Living area),x2(i)表示的是第i间房子的卧室数量(bedrooms). 在我们设计算法的时候,选取哪些特征这个问题往往是取决于我们个人的,只要能对算法有利,尽量选取. 对于假设函数,这里我们用一个线性方程(在后面我们会说到运用更复杂的假设函数):hΘ(x) = Θ0+Θ1