UFLDL教程笔记及练习答案六(稀疏编码与稀疏编码自编码表达)

稀疏编码(SparseCoding)

sparse coding也是deep learning中一个重要的分支,同样能够提取出数据集很好的特征(稀疏的)。选择使用具有稀疏性的分量来表示我们的输入数据是有原因的,因为绝大多数的感官数据,比如自然图像,可以被表示成少量基本元素的叠加,在图像中这些基本元素可以是面或者线。

稀疏编码算法的目的就是找到一组基向量使得我们能将输入向量x表示成这些基向量的线性组合:

这里构成的基向量要求是超完备的,即要求k大于n,这样的方程就大多情况会有无穷多个解,此时我们给它加入一个稀疏性的限制,最终的优化公式变成了如下形式:

其中S(ai)就是稀疏惩罚项,可以是L0或者L1范数,L1范数和L0范数都可以表示稀疏编码,L1范数是L0范数的最优凸近似,但是L1因具有比L0更好的优化求解特性而被广泛应用。

稀疏编码自编码表达:

将稀疏编码用到深度学习中,用于提取数据集良好的稀疏特征,设A为超完备的基向量,s表示输入数据最后的稀疏特征(也就是稀疏编码中的稀疏系数),这样就可以表示成X
= A*s。

其实这里的A就等同于稀疏自编码中的W2,而s就是隐层的结点值。(当具有很多样本的时候,s就是一个矩阵,每一列表示的是一个样本的稀疏特征值)

最终的优化函数变成了:

优化步骤为:

可以采用以下两个trick来提高最后的迭代速度和精度。

(1)将样本表示成良好的“迷你块”,比如你有10000个样本,我们可以每次只随机选择2000个mini_patches进行迭代,这样不仅提高了迭代的速度,也提升了收敛的速度。

(2)良好的s初始化。因为我们的目标是X=As,因此交叉迭代优化的过程中,在A给定的情况下,我们可以将s初始化为s=AT*X,但这样可能会导致稀疏性的缺失,我们在做一个规范,这里其中的s的行数就等于A列数,然后用s的每个元素除以其所在A的那一列向量的2范数。

即   其中Ar表示矩阵A的第r列。这样对s做规范化处理就是为了保持较小的稀疏惩罚值。<个人认为UFLDL教程中该处A的下标是错误的。>

因此最后的优化算法步骤表示为:

注意:以上对s和A采用交叉迭代优化,其中我们会发现分别对s和A求导的时候,发现可以直接得出A的解析形式的解,因此在优化A的时候直接给出其解析形式的解就可以了,而s我们无法给出其解析形式的解,就需要用梯度迭代等无约束的优化问题了。

测试:当有一个新的样本x,我们需要用以上训练集得到的A来优化以上cost函数得到s就是该样本的稀疏特征。这样相比之前的前馈网络的,每次对新的数据样本进行“编码”,我们必须再次执行优化过程来得到所需的系数。

注:教程中拓扑稀疏编码的内容我还没有弄明白是如何得到groupMatrix的,这里就不误导大家了。

练习答案:

以下对grad的求解可以参看这篇blog:http://www.cnblogs.com/tornadomeet/archive/2013/04/16/3024292.html给出的推导结果。

Sparse_coding_exercise.m

%% CS294A/CS294W Sparse Coding Exercise

%  Instructions
%  ------------
%
%  This file contains code that helps you get started on the
%  sparse coding exercise. In this exercise, you will need to modify
%  sparseCodingFeatureCost.m and sparseCodingWeightCost.m. You will also
%  need to modify this file, sparseCodingExercise.m slightly.

% Add the paths to your earlier exercises if necessary
% addpath /path/to/solution

%%======================================================================
%% STEP 0: Initialization
%  Here we initialize some parameters used for the exercise.

numPatches = 20000;   % number of patches
numFeatures = 121;    % number of features to learn
patchDim = 8;         % patch dimension
visibleSize = patchDim * patchDim; 

% dimension of the grouping region (poolDim x poolDim) for topographic sparse coding
poolDim = 3;

% number of patches per batch
batchNumPatches = 2000; 

lambda = 5e-5;  % L1-regularisation parameter (on features)
epsilon = 1e-5; % L1-regularisation epsilon |x| ~ sqrt(x^2 + epsilon)
gamma = 1e-2;   % L2-regularisation parameter (on basis)

%%======================================================================
%% STEP 1: Sample patches

images = load('IMAGES.mat');
images = images.IMAGES;

patches = sampleIMAGES(images, patchDim, numPatches);
display_network(patches(:, 1:64));

%%======================================================================
%% STEP 2: Implement and check sparse coding cost functions
%  Implement the two sparse coding cost functions and check your gradients.
%  The two cost functions are
%  1) sparseCodingFeatureCost (in sparseCodingFeatureCost.m) for the features
%     (used when optimizing for s, which is called featureMatrix in this exercise)
%  2) sparseCodingWeightCost (in sparseCodingWeightCost.m) for the weights
%     (used when optimizing for A, which is called weightMatrix in this exericse)

% We reduce the number of features and number of patches for debugging
% numFeatures = 25;
% patches = patches(:, 1:5);
% numPatches = 5;

weightMatrix = randn(visibleSize, numFeatures) * 0.005;
featureMatrix = randn(numFeatures, numPatches) * 0.005;

%% STEP 2a: Implement and test weight cost
%  Implement sparseCodingWeightCost in sparseCodingWeightCost.m and check
%  the gradient

[cost, grad] = sparseCodingWeightCost(weightMatrix, featureMatrix, visibleSize, numFeatures, patches, gamma, lambda, epsilon);

numgrad = computeNumericalGradient( @(x) sparseCodingWeightCost(x, featureMatrix, visibleSize, numFeatures, patches, gamma, lambda, epsilon), weightMatrix(:) );
% Uncomment the blow line to display the numerical and analytic gradients side by side
% disp([numgrad grad]);
diff = norm(numgrad-grad)/norm(numgrad+grad);
fprintf('Weight difference: %g\n', diff);
assert(diff < 1e-8, 'Weight difference too large. Check your weight cost function. ');

%% STEP 2b: Implement and test feature cost (non-topographic)
%  Implement sparseCodingFeatureCost in sparseCodingFeatureCost.m and check
%  the gradient. You may wish to implement the non-topographic version of
%  the cost function first, and extend it to the topographic version later.

% Set epsilon to a larger value so checking the gradient numerically makes sense
epsilon = 1e-2;

% We pass in the identity matrix as the grouping matrix, putting each
% feature in a group on its own.
groupMatrix = eye(numFeatures);

[cost, grad] = sparseCodingFeatureCost(weightMatrix, featureMatrix, visibleSize, numFeatures, patches, gamma, lambda, epsilon, groupMatrix);

numgrad = computeNumericalGradient( @(x) sparseCodingFeatureCost(weightMatrix, x, visibleSize, numFeatures, patches, gamma, lambda, epsilon, groupMatrix), featureMatrix(:) );
% Uncomment the blow line to display the numerical and analytic gradients side by side
% disp([numgrad grad]);
diff = norm(numgrad-grad)/norm(numgrad+grad);
fprintf('Feature difference (non-topographic): %g\n', diff);
assert(diff < 1e-8, 'Feature difference too large. Check your feature cost function. ');

%% STEP 2c: Implement and test feature cost (topographic)
%  Implement sparseCodingFeatureCost in sparseCodingFeatureCost.m and check
%  the gradient. This time, we will pass a random grouping matrix in to
%  check if your costs and gradients are correct for the topographic
%  version.

% Set epsilon to a larger value so checking the gradient numerically makes sense
epsilon = 1e-2;

% This time we pass in a random grouping matrix to check if the grouping is
% correct.
groupMatrix = rand(100, numFeatures);

[cost, grad] = sparseCodingFeatureCost(weightMatrix, featureMatrix, visibleSize, numFeatures, patches, gamma, lambda, epsilon, groupMatrix);

numgrad = computeNumericalGradient( @(x) sparseCodingFeatureCost(weightMatrix, x, visibleSize, numFeatures, patches, gamma, lambda, epsilon, groupMatrix), featureMatrix(:) );
% Uncomment the blow line to display the numerical and analytic gradients side by side
% disp([numgrad grad]);
diff = norm(numgrad-grad)/norm(numgrad+grad);
fprintf('Feature difference (topographic): %g\n', diff);
assert(diff < 1e-8, 'Feature difference too large. Check your feature cost function. ');

%%======================================================================
%% STEP 3: Iterative optimization
%  Once you have implemented the cost functions, you can now optimize for
%  the objective iteratively. The code to do the iterative optimization
%  using mini-batching and good initialization of the features has already
%  been included for you.
%
%  However, you will still need to derive and fill in the analytic solution
%  for optimizing the weight matrix given the features.
%  Derive the solution and implement it in the code below, verify the
%  gradient as described in the instructions below, and then run the
%  iterative optimization.

% Initialize options for minFunc
options.Method = 'lbfgs';
options.display = 'off';
options.verbose = 0;

% Initialize matrices
weightMatrix = rand(visibleSize, numFeatures);
featureMatrix = rand(numFeatures, batchNumPatches);

% Initialize grouping matrix
assert(floor(sqrt(numFeatures)) ^2 == numFeatures, 'numFeatures should be a perfect square');
donutDim = floor(sqrt(numFeatures));
assert(donutDim * donutDim == numFeatures,'donutDim^2 must be equal to numFeatures');

groupMatrix = zeros(numFeatures, donutDim, donutDim);

groupNum = 1;     %% 获得拓扑稀疏编码   这段处理不太懂啊!!
for row = 1:donutDim
    for col = 1:donutDim
        groupMatrix(groupNum, 1:poolDim, 1:poolDim) = 1;
        groupNum = groupNum + 1;
        groupMatrix = circshift(groupMatrix, [0 0 -1]);
    end
    groupMatrix = circshift(groupMatrix, [0 -1, 0]);
end

groupMatrix = reshape(groupMatrix, numFeatures, numFeatures);
if isequal(questdlg('Initialize grouping matrix for topographic or non-topographic sparse coding?', 'Topographic/non-topographic?', 'Non-topographic', 'Topographic', 'Non-topographic'), 'Non-topographic')
    groupMatrix = eye(numFeatures);
end

% Initial batch
indices = randperm(numPatches);
indices = indices(1:batchNumPatches);
batchPatches = patches(:, indices);                           

fprintf('%6s%12s%12s%12s%12s\n','Iter', 'fObj','fResidue','fSparsity','fWeight');

for iteration = 1:200                                 %% 因为要交替优化直到最小化cost function, 所以才这样进行的
    error = weightMatrix * featureMatrix - batchPatches;
    error = sum(error(:) .^ 2) / batchNumPatches;

    fResidue = error;

    R = groupMatrix * (featureMatrix .^ 2);
    R = sqrt(R + epsilon);
    fSparsity = lambda * sum(R(:));    

    fWeight = gamma * sum(weightMatrix(:) .^ 2);

    fprintf('  %4d  %10.4f  %10.4f  %10.4f  %10.4f\n', iteration, fResidue+fSparsity+fWeight, fResidue, fSparsity, fWeight)   %% 以上这部分可以不用的,只是为了显示最终的

    % Select a new batch
    indices = randperm(numPatches);   %% 重新挑选2000个样本用来进行训练
    indices = indices(1:batchNumPatches);
    batchPatches = patches(:, indices);         %%% 重新挑选的样本                 

    % Reinitialize featureMatrix with respect to the new batch
    featureMatrix = weightMatrix' * batchPatches;           %% trick 对featureMatrix(s)进行初始化 --技巧 方法
    normWM = sum(weightMatrix .^ 2)';                     %%%%% 也就是weightMatrix矩阵每列的平方和
    featureMatrix = bsxfun(@rdivide, featureMatrix, normWM);   %% featureMatrix除以上者

    % Optimize for feature matrix
    options.maxIter = 20;   % 迭代20次,并对featureMatrix进行无约束优化
    [featureMatrix, cost] = minFunc( @(x) sparseCodingFeatureCost(weightMatrix, x, visibleSize, numFeatures, batchPatches, gamma, lambda, epsilon, groupMatrix), ...
                                           featureMatrix(:), options);
    featureMatrix = reshape(featureMatrix, numFeatures, batchNumPatches);                                      

    % Optimize for weight matrix
    weightMatrix = zeros(visibleSize, numFeatures);      %%%  通过直接求导得出对weightMatrix进行优化,这里无需进行梯度迭代或者牛顿法等得出最终的结果
    weightMatrix = batchPatches*featureMatrix'/(gamma*batchNumPatches* eye(size(featureMatrix, 1)) + featureMatrix*featureMatrix');

    % -------------------- YOUR CODE HERE --------------------
    % Instructions:
    %   Fill in the analytic solution for weightMatrix that minimizes
    %   the weight cost here.
    %   Once that is done, use the code provided below to check that your
    %   closed form solution is correct.
    %   Once you have verified that your closed form solution is correct,
    %   you should comment out the checking code before running the
    %   optimization.

    [cost, grad] = sparseCodingWeightCost(weightMatrix, featureMatrix, visibleSize, numFeatures, batchPatches, gamma, lambda, epsilon, groupMatrix);
    assert(norm(grad) < 1e-12, 'Weight gradient should be close to 0. Check your closed form solution for weightMatrix again.')
    error('Weight gradient is okay. Comment out checking code before running optimization.');
    % -------------------- YOUR CODE HERE --------------------  

    % Visualize learned basis
    figure(1);
    display_network(weightMatrix);
end

sparseCodingWeight.m

function [cost, grad] = sparseCodingWeightCost(weightMatrix, featureMatrix, visibleSize, numFeatures,  patches, gamma, lambda, epsilon, groupMatrix)
%sparseCodingWeightCost - given the features in featureMatrix,
%                         computes the cost and gradient with respect to
%                         the weights, given in weightMatrix
% parameters
%   weightMatrix  - the weight matrix. weightMatrix(:, c) is the cth basis
%                   vector.
%   featureMatrix - the feature matrix. featureMatrix(:, c) is the features
%                   for the cth example
%   visibleSize   - number of pixels in the patches
%   numFeatures   - number of features
%   patches       - patches
%   gamma         - weight decay parameter (on weightMatrix)
%   lambda        - L1 sparsity weight (on featureMatrix)
%   epsilon       - L1 sparsity epsilon
%   groupMatrix   - the grouping matrix. groupMatrix(r, :) indicates the
%                   features included in the rth group. groupMatrix(r, c)
%                   is 1 if the cth feature is in the rth group and 0
%                   otherwise.

    if exist('groupMatrix', 'var')
        assert(size(groupMatrix, 2) == numFeatures, 'groupMatrix has bad dimension');
    else
        groupMatrix = eye(numFeatures);
    end

    numExamples = size(patches, 2);

    weightMatrix = reshape(weightMatrix, visibleSize, numFeatures);
    featureMatrix = reshape(featureMatrix, numFeatures, numExamples);

    % -------------------- YOUR CODE HERE --------------------
    % Instructions:
    %   Write code to compute the cost and gradient with respect to the
    %   weights given in weightMatrix.
    % -------------------- YOUR CODE HERE --------------------   

    ave_square = sum(sum((weightMatrix * featureMatrix - patches).^2))./numExamples;   %计算重构误差
    weight_decay = gamma * sum(sum(weightMatrix.^2));            %
    cost = ave_square + weight_decay;

    grad = (2*weightMatrix*featureMatrix*featureMatrix' - 2 * patches*featureMatrix')./numExamples + 2*gamma*weightMatrix;
    grad = grad(:);

end

sparseCodingFeatureCost.m

function [cost, grad] = sparseCodingFeatureCost(weightMatrix, featureMatrix, visibleSize, numFeatures, patches, gamma, lambda, epsilon, groupMatrix)
%sparseCodingFeatureCost - given the weights in weightMatrix,
%                          computes the cost and gradient with respect to
%                          the features, given in featureMatrix
% parameters
%   weightMatrix  - the weight matrix. weightMatrix(:, c) is the cth basis
%                   vector.
%   featureMatrix - the feature matrix. featureMatrix(:, c) is the features
%                   for the cth example
%   visibleSize   - number of pixels in the patches
%   numFeatures   - number of features
%   patches       - patches
%   gamma         - weight decay parameter (on weightMatrix)
%   lambda        - L1 sparsity weight (on featureMatrix)
%   epsilon       - L1 sparsity epsilon
%   groupMatrix   - the grouping matrix. groupMatrix(r, :) indicates the
%                   features included in the rth group. groupMatrix(r, c)
%                   is 1 if the cth feature is in the rth group and 0
%                   otherwise.
    isTopo = 1;
    if exist('groupMatrix', 'var')
        assert(size(groupMatrix, 2) == numFeatures, 'groupMatrix has bad dimension');
        if(isequal(groupMatrix, eye(numFeatures)))
            isTopo = 0;
        end
    else
        groupMatrix = eye(numFeatures);
        isTopo = 0;
    end

    numExamples = size(patches, 2);

    weightMatrix = reshape(weightMatrix, visibleSize, numFeatures);
    featureMatrix = reshape(featureMatrix, numFeatures, numExamples);

    % -------------------- YOUR CODE HERE --------------------
    % Instructions:
    %   Write code to compute the cost and gradient with respect to the
    %   features given in featureMatrix.
    %   You may wish to write the non-topographic version, ignoring
    %   the grouping matrix groupMatrix first, and extend the
    %   non-topographic version to the topographic version later.
    % -------------------- YOUR CODE HERE --------------------
     ave_square = sum(sum((weightMatrix * featureMatrix - patches).^2))./numExamples;    % 计算重构误差
     sparsity = lambda .* sum(sum(sqrt( groupMatrix * (featureMatrix.^2) + epsilon)));      %计算系数惩罚项
     cost = ave_square + sparsity;
     gradResidue = (2* weightMatrix'* weightMatrix*featureMatrix - 2*weightMatrix'*patches)./numExamples; %%+ lambda*featureMatrix./sqrt(featureMatrix.^2+epsilon);

     if ~isTopo
        gradSparsity = lambda*featureMatrix./sqrt(featureMatrix.^2+epsilon);   %%% 不是拓扑的稀疏编码
     else
        gradSparsity = lambda * groupMatrix'*(groupMatrix *(featureMatrix .^ 2) + epsilon).^(0.5).*featureMatrix;   %% 拓扑稀疏编码
     end
     grad = gradResidue + gradSparsity;
     grad = grad(:);

end

参考文献:

1:UFLDL教程http://ufldl.stanford.edu/wiki/index.php/UFLDL%E6%95%99%E7%A8%8B

2:http://blog.csdn.net/zouxy09/article/details/24971995/机器学习中的范数规则化之(一)L0、L1与L2范数

3:http://www.cnblogs.com/tornadomeet/archive/2013/04/16/3024292.htmlDeep
learning:二十九(Sparse coding练习)

4:http://www.cnblogs.com/tornadomeet/archive/2013/04/13/3018393.htmlDeep
learning:二十六(Sparse coding简单理解)

时间: 2024-11-01 12:32:46

UFLDL教程笔记及练习答案六(稀疏编码与稀疏编码自编码表达)的相关文章

UFLDL教程笔记及练习答案二(预处理:主成分分析和白化)

首先将本节主要内容记录下来,然后给出课后习题的答案. 笔记: 1:首先我想推导用SVD求解PCA的合理性. PCA原理:假设样本数据X∈Rm×n,其中m是样本数量,n是样本的维数.PCA降维的目的就是为了使将数据样本由原来的n维降低到k维(k<n).方法是找数据随之变化的主轴,在Andrew Ng的网易公开课上我们知道主方向就是X的协方差所对应的最大特征值所对应的特征向量的方向(前提是这里X在维度上已经进行了均值归一化).在matlab中我们通常可以用princomp函数来求解,详细见:http

UFLDL教程笔记及练习答案五(自编码线性解码器与处理大型图像)

自动编码线性解码器 自动编码线性解码器主要是考虑到稀疏自动编码器最后一层输出如果用sigmoid函数,由于稀疏自动编码器学习是的输出等于输入,simoid函数的值域在[0,1]之间,这就要求输入也必须在[0,1]之间,这是对输入特征的隐藏限制,为了解除这一限制,我们可以使最后一层用线性函数及a = z 习题答案: SparseAutoEncoderLinerCost.m function [cost,grad,features] = sparseAutoencoderLinearCost(the

UFLDL教程笔记及练习答案三(Softmax回归与自我学习)

1:softmax回归 当p(y|x,theta)满足多项式分布,通过GLM对其进行建模就能得到htheta(x)关于theta的函数,将其称为softmax回归.教程中已经给了cost及gradient的求法.需要注意的是一般用最优化方法求解参数theta的时候,采用的是贝叶斯学派的思想,需要加上参数theta. 习题答案: (1) 数据加载------代码已给 (2) %% STEP 2: Implement softmaxCost   得到计算cost和gradient M = theta

UFLDL教程笔记及练习答案三(Softmax回归与自我学习***)

1:softmax回归 当p(y|x,theta)满足多项式分布,通过GLM对其进行建模就能得到htheta(x)关于theta的函数,将其称为softmax回归. 教程中已经给了cost及gradient的求法.须要注意的是一般用最优化方法求解參数theta的时候,採用的是贝叶斯学派的思想,须要加上參数theta. softmax回归 习题的任务就是用原有的像素数据集构建一个softmax回归模型进行分类.准确率略低 92.6%,. 而自我学习是用5~9之间的数据集当做无标签的数据集,然后构建

UFLDL教程笔记及练习答案四(建立分类用深度学习)

此次主要由自我学习过度到深度学习,简单记录如下: (1)深度学习比浅层网络学习对特征具有更优异的表达能力和紧密简洁的表达了比浅层网络大的多的函数集合. (2)将传统的浅层神经网络进行扩展会存在数据获取.局部最值和梯度弥散的缺点. (3)栈式自编码神经网络是由多层稀疏自编码器构成的神经网络(最后一层采用的softmax回归或者logistic回归分类),采用逐层贪婪的训练方法得到初始的参数,这样在数据获取方面就可以充分利用无标签的数据.通过逐层贪婪的训练方法又称为预训练,然后可以使用有标签的数据集

神级网络 - UFLDL教程笔记

激活函数: 1)sigmoid函数 - 值域(0,1)    2)tanh函数 - 值域(-1,1)   两个函数都扩展至向量表示:      - 网络层数  - 第l层的节点数(不包括偏置单元)  - 第l层第j单元 与 第l+1层第i单元之间的连接参数,大小为  - 第l+1层第i单元的偏置项  - 第l层的激活值  - 第l层第i单元输入加权和(包括偏置单元)  - 样本 m - 样本数 α - 学习率 λ - 权重衰减参数,控制方差代价函数两项的相对重要性. hw,b(x)=a 前向传播

UFLDL 教程三总结与答案

主成分分析(PCA)是一种能够极大提升无监督特征学习速度的数据降维算法.更重要的是,理解PCA算法,对实现白化算法有很大的帮助,很多算法都先用白化算法作预处理步骤.这里以处理自然图像为例作解释. 1.计算协方差矩阵:   按照通常约束,x为特征变量,上边表示样本数目,下标表示特征数目.这里样本数为m. xRot = zeros(size(x)); sigma=x*x'/size(x,2); %sigma为协方差矩阵 [U,S,V]=svd(sigma); %U为特征向量,X为特征值,V为U的转置

UFLDL 教程答案 稀疏编码与softmax篇的答案已经传到资源,大家可以免费下载~

UFLDL 教程答案 稀疏编码篇与softmax篇的答案已经传到资源,大家可以免费下载~ 另外,关于资源里面描述的低效率的代码的问题,逗比的博主已经找到了解决方案,大家需要修改两个文件的两处代码,绿色是需要被注释的 softmaxCost.m文件 %% 非向量化 %for i = 1 : numCases %    thetagrad = thetagrad + (groundTruth(:,i) - Hx(:,i)) * data(:,i)'; % 10 * 100, 8 * 100 %end

深度学习UFLDL老教程笔记1 稀疏自编码器Ⅱ

稀疏自编码器的学习结构: 稀疏自编码器Ⅰ: 神经网络 反向传导算法 梯度检验与高级优化 稀疏自编码器Ⅱ: 自编码算法与稀疏性 可视化自编码器训练结果 Exercise: Sparse Autoencoder 自编码算法与稀疏性 已经讨论了神经网络在有监督学习中的应用,其中训练样本是有类别标签的(x_i,y_i). 自编码神经网络是一种无监督学习算法,它使用了反向传播算法,并让目标值等于输入值x_i = y_i . 下图是一个自编码神经网络的示例. 一次autoencoder学习,结构三层:输入层