Softmax Regression

This model generalizes logistic regression to classification problems where the class label y    can take on more than two possible values.

Softmax regression is a supervised learning algorithm, but we will later be using it in       conjuction with our deep learning/unsupervised feature learning methods.

With logistic regression, we were in the binary classification setting, so the labels were . Our hypothesis took the form:

and the model parameters θ were trained to minimize the cost function

 

In the softmax regression setting, we are interested in multi-class classification (as opposed to only binary classification), and so the label y can take on k different values, rather than only two.

Given a test input x, we want our hypothesis to estimate the probability that p(y = j | x) for each value of . I.e., we want to estimate the probability of the class label taking on each of the k different possible values. Thus, our hypothesis will output a k dimensional vector (whose elements sum to 1) giving us our k estimated probabilities. Concretely, our hypothesis hθ(x) takes the form:

Here are the parameters of our model. Notice that the term normalizes the distribution, so that it sums to one.

For convenience, we will also write θ to denote all the parameters of our model. When you implement softmax regression, it is usually convenient to represent θ as a k-by-(n + 1) matrix obtained by stacking up in rows, so that

   theta 为k*(n+1)维的矩阵

 

Cost Function

Our cost function will be:

Notice that this generalizes the logistic regression cost function, which could also have been written:

 

is the indicator function, so that 1{a true statement} = 1, and 1{a false statement} = 0.

The softmax cost function is similar, except that we now sum over the k different possible values of the class label. Note also that in softmax regression, we have that .

Properties of softmax regression parameterization

Softmax regression has an unusual property that it has a "redundant" set of parameters. To explain what this means, suppose we take each of our parameter vectors θj, and subtract some fixed vector ψ from it, so that every θj is now replaced withθj ? ψ (for every ). Our hypothesis now estimates the class label probabilities as

In other words, subtracting ψ from every θj does not affect our hypothesis‘ predictions at all! This shows that softmax regression‘s parameters are "redundant." More formally, we say that our softmax model is overparameterized, meaning that for any hypothesis we might fit to the data, there are multiple parameter settings that give rise to exactly the same hypothesis function hθ mapping from inputs x to the predictions.

 

Further, if the cost function J(θ) is minimized by some setting of the parameters , then it is also minimized by for any value of ψ. Thus, the minimizer of J(θ) is not unique. (Interestingly, J(θ) is still convex, and thus gradient descent will not run into a local optima problems. But the Hessian is singular/non-invertible, which causes a straightforward implementation of Newton‘s method to run into numerical problems.)

Notice also that by setting ψ = θ1, one can always replace θ1 with (the vector of all 0‘s), without affecting the hypothesis. Thus, one could "eliminate" the vector of parameters θ1 (or any other θj, for any single value of j), without harming the representational power of our hypothesis. Indeed, rather than optimizing over the k(n + 1) parameters (where ), one could instead set and optimize only with respect to the (k ? 1)(n + 1)remaining parameters, and this would work fine. 降一维

In practice, however, it is often cleaner and simpler to implement the version which keeps all the parameters , without arbitrarily setting one of them to zero. But we will make one change to the cost function: Adding weight decay. This will take care of the numerical problems associated with softmax regression‘s overparameterized representation.

Weight Decay

We will modify the cost function by adding a weight decay term which penalizes large values of the parameters. Our cost function is now

With this weight decay term (for any λ > 0), the cost function J(θ) is now strictly convex, and is guaranteed to have a unique solution. The Hessian is now invertible, and because J(θ) is convex, algorithms such as gradient descent, L-BFGS, etc. are guaranteed to converge to the global minimum.

To apply an optimization algorithm, we also need the derivative of this new definition of J(θ). One can show that the derivative is:

Relationship to Logistic Regression

In the special case where k = 2, one can show that softmax regression reduces to logistic regression. This shows that softmax regression is a generalization of logistic regression. Concretely, when k = 2, the softmax regression hypothesis outputs

Taking advantage of the fact that this hypothesis is overparameterized and setting ψ = θ1, we can subtract θ1 from each of the two parameters, giving us

Thus, replacing θ2 ? θ1 with a single parameter vector θ‘, we find that softmax regression predicts the probability of one of the classes as , and that of the other class as , same as logistic regression.

 

Softmax Regression vs. k Binary Classifiers

Suppose you are working on a music classification application, and there are k types of music that you are trying to recognize. Should you use a softmax classifier, or should you build k separate binary classifiers using logistic regression?

This will depend on whether the four classes are mutually exclusive. For example, if your four classes are classical, country, rock, and jazz, then assuming each of your training examples is labeled with exactly one of these four class labels, you should build a softmax classifier with k = 4. (If there‘re also some examples that are none of the above four classes, then you can set k = 5 in softmax regression, and also have a fifth, "none of the above," class.) 相互独立 —— softmax

If however your categories are has_vocals, dance, soundtrack, pop, then the classes are not mutually exclusive; for example, there can be a piece of pop music that comes from a soundtrack and in addition has vocals. In this case, it would be more appropriate to build 4 binary logistic regression classifiers. This way, for each new musical piece, your algorithm can separately decide whether it falls into each of the four categories. 不相互独立(之间有交集) —— k binary logistic regression classifiers

Now, consider a computer vision example, where you‘re trying to classify images into three different classes. (i) Suppose that your classes are indoor_scene, outdoor_urban_scene, and outdoor_wilderness_scene. Would you use sofmax regression or three logistic regression classifiers? (ii) Now suppose your classes are indoor_scene, black_and_white_image, and image_has_people. Would you use softmax regression or multiple logistic regression classifiers?

In the first case, the classes are mutually exclusive, so a softmax regression classifier would be appropriate. In the second case, it would be more appropriate to build three separate logistic regression classifiers.

时间: 2024-08-06 11:22:48

Softmax Regression的相关文章

ufldl学习笔记与编程作业:Softmax Regression(vectorization加速)

ufldl出了新教程,感觉比之前的好,从基础讲起,系统清晰,又有编程实践. 在deep learning高质量群里面听一些前辈说,不必深究其他机器学习的算法,可以直接来学dl. 于是最近就开始搞这个了,教程加上matlab编程,就是完美啊. 新教程的地址是:http://ufldl.stanford.edu/tutorial/ 本节是对ufldl学习笔记与编程作业:Softmax Regression(softmax回归)版本的改进. 哈哈,把向量化的写法给写出来了,尼玛好快啊.只需要2分钟,2

学习笔记TF024:TensorFlow实现Softmax Regression(回归)识别手写数字

TensorFlow实现Softmax Regression(回归)识别手写数字.MNIST(Mixed National Institute of Standards and Technology database),简单机器视觉数据集,28X28像素手写数字,只有灰度值信息,空白部分为0,笔迹根据颜色深浅取[0, 1], 784维,丢弃二维空间信息,目标分0~9共10类.数据加载,data.read_data_sets, 55000个样本,测试集10000样本,验证集5000样本.样本标注信

UFLDL实验报告1: Softmax Regression

PS:这些是今年4月份,跟斯坦福UFLDL教程时的实验报告,当时就应该好好整理的…留到现在好凌乱了 Softmax Regression实验报告 1.Softmax Regression实验描述 Softmax回归模型是逻辑回归模型的推广,它可以把数据分类到两个以上的类别.在本实验中,我们的目标是采用Softmax回归模型对MNIST手写数字数据库进行分类,识别每个手写数字,把它们归类于0到9之间的10个类别.实验中需要计算成本函数J,参数Theta,成本函数的梯度,及预测假设h. Figure

ufldl学习笔记和编程作业:Softmax Regression(softmax回报)

ufldl学习笔记与编程作业:Softmax Regression(softmax回归) ufldl出了新教程.感觉比之前的好,从基础讲起.系统清晰,又有编程实践. 在deep learning高质量群里面听一些前辈说,不必深究其它机器学习的算法,能够直接来学dl. 于是近期就開始搞这个了.教程加上matlab编程,就是完美啊. 新教程的地址是:http://ufldl.stanford.edu/tutorial/ 本节学习链接:http://ufldl.stanford.edu/tutoria

机器学习方法(五):逻辑回归Logistic Regression,Softmax Regression

技术交流QQ群:433250724,欢迎对算法.技术.应用感兴趣的同学加入. 前面介绍过线性回归的基本知识,线性回归因为它的简单,易用,且可以求出闭合解,被广泛地运用在各种机器学习应用中.事实上,除了单独使用,线性回归也是很多其他算法的组成部分.线性回归的缺点也是很明显的,因为线性回归是输入到输出的线性变换,拟合能力有限:另外,线性回归的目标值可以是(?∞,+∞),而有的时候,目标值的范围是[0,1](可以表示概率值),那么就不方便了. 逻辑回归可以说是最为常用的机器学习算法之一,最经典的场景就

【DeepLearning】Exercise:Softmax Regression

Exercise:Softmax Regression 习题的链接:Exercise:Softmax Regression softmaxCost.m function [cost, grad] = softmaxCost(theta, numClasses, inputSize, lambda, data, labels) % numClasses - the number of classes % inputSize - the size N of the input vector % la

ufldl学习笔记与编程作业:Softmax Regression(softmax回归)

ufldl出了新教程,感觉比之前的好,从基础讲起,系统清晰,又有编程实践. 在deep learning高质量群里面听一些前辈说,不必深究其他机器学习的算法,可以直接来学dl. 于是最近就开始搞这个了,教程加上matlab编程,就是完美啊. 新教程的地址是:http://ufldl.stanford.edu/tutorial/ 本节学习链接:http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/ softmax回归其实是逻

深度学习 Deep Learning UFLDL 最新Tutorial 学习笔记 5:Softmax Regression

Softmax Regression Tutorial地址:http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/ 从本节开始,难度开始加大了,我将更详细地解释一下这个Tutorial. 1 Softmax Regression 介绍 前面我们已经知道了Logistic Regression,简单的说就判断一个样本属于1或者0,在应用中比如手的识别,那么就是判断一个图片是手还是非手.这就是很简单的分类.事实上,我们只要把L

UFLDL 08 Softmax Regression

所谓softmax regression 是在logistic regression基础上的升级版. logistics是二分类,而softmax可以多分类. 1 logistic regression 学习softmax regression之前 我们先回归一下 logistic regression的相关知识. (参见http://blog.csdn.net/bea_tree/article/details/50432411#t6) logistic regression的函数是 他的名字虽