Support Vector Machines for classification

Support Vector Machines for classification

To whet your appetite for support vector machines, here’s a quote from machine learning researcher Andrew Ng:

“SVMs are among the best (and many believe are indeed the best) ‘off-the-shelf’ supervised learning algorithms.”

Professor Ng covers SVMs in his excellent Machine Learning MOOC, a gateway for many into the realm of data science, but leaves out some details, motivating us to put together some notes here to answer the question:

“What are the support vectors in support vector machines?”

We also provide python code using scikit-learn’s svm module to fit a binary classification problem using a custom kernel, along with code to generate the (awesome!) interactive plots in Part 3.

This post consists of three sections:

  • Part 1 sets up the problem from a geometric point of view and then shows how it can be framed as an optimization problem.
  • Part 2 transforms the optimization problem and uncovers the support vectors in the process.
  • Part 3 discusses how kernels can be used to separate non-linearly separable data.

Part 1: Defining the margin

Maximizing the margin

The figure below is a binary classification problem (points labeled yi=±1) that is linearly separable.

There are many possible decision boundaries that would perfectly separate the two classes, but an SVM will choose the line in 2-d (or “hyperplane”, more generally) that maximizes the margin around the boundary.

Intuitively, we can be very confident about the labels of points that fall far from the boundary, but we’re less confident about points near the boundary.

Formulating the margin with geometry

Any point x lying on the separating hyperplane satisfies:
w⋅x+b=0
w is the vector normal to the plane, and b is a constant that describes how much the plane is shifted relative to the origin.  The distance of the plane from the origin is b/∥w∥.

Now draw parallel planes on either side of the decision boundary, so we have what looks like a road, with the decision boundary as the median, and the additional planes as gutters.  The margin, i.e. the width of the road, is (d++d−) and is restricted by the data points closest to the boundary, which lie on the gutters.

The half-spaces bounded by the planes on the gutters are:

w⋅x+b≥+a, for yi=+1

w⋅x+b≤−a, for yi=−1

These two conditions can be put more succinctly:

yi(w⋅x+b)≥a,∀i

Some arithmetic leads to the equation for the margin:

d++d−=2a/∥w∥

Without loss of generality, we can set a=1, since it only sets the scale (units) of b and w.  So to maximize the margin, we have to maximize 1/∥w∥.  However, this is an unpleasant (non-convex) objective function.  Instead we minimize ∥w∥2, which is convex.

The optimization problem

Maximizing the margin boils down to a constrained optimization problem: minimize some quantity f(w), subject to constraints g(w,b).  This optimization problem is particularly nice because it is convex; the objective ∥w∥2 is convex, as are the constraints, which are linear.

In other words, we are faced with a quadratic programming problem.  The standard format of the optimization problem for the separable case is

minimizesubject tof(w)≡(1/2)∥w∥2g(w,b)≡−yi(w⋅x+b)+1≤0,i=1…m(1)

Before we address how to solve this optimization problem in Part 2, let’s first consider the case when data is non-separable.

Soft margin SVM: the non-separable problem and regularization

For non-separable data, we relax the constraints in (1) while penalizing misclassified points via a cost parameter C and slack variables ξi that define the amount by which data points are on the wrong side of the margin.

minimizesubject to(1/2)∥w∥2+C∑imξiyi(w⋅x+b)≥1–ξi,i=1…mξi≥0,i=1…m(2)

A large penalty — large C — for misclassifications will lead to learning a lower bias, higher variance SVM, and vice versa for small C.

The soft margin is used in practice; even in the separable case, it can be desirable to allow tradeoffs between the size of the margin and number of misclassifications. Outliers can skew the decision boundary learned by (1) towards a model with small margins + perfect classification, in contrast to a possibly more robust model learned by (2) with large margins + some misclassified points.


Part 2: Solving the optimization problem

In Part 1, we showed how to set up SVMs as an optimization problem. In this section, we’ll see how the eponymous support vectors emerge when we rephrase the minimization problem as an equivalent maximization problem.

To recap: Given m training points that are labeled yi=±1, our goal is to maximize the margin of the hyperplane defined by w⋅x+b=0.

We’ll use the separable case (1) as our starting point, but the steps in the procedure and final result are similar for the non-separable case (also worked out in ref [3]).

The Lagrangian formulation

How do we solve this optimization problem? Minimizing a function without constraints is probably familiar: set the derivative of the function (the objective) to zero and solve.

With constraints, the procedure is similar to setting the derivative of the objective equal to zero. Instead of taking the derivative of the objective itself, however, we’ll operate on the Lagrangian L, which combines the objective and inequality constraints into one function:

L(w,b,α)=f(w)+∑imαigi(w,b)(3)

We’ve just introduced additional variables αi, Lagrange multipliers, that make it easier to work with the constraints (see Wikipedia about the method of Lagrange multipliers). Note, a more general form for the Lagrangian would include another summation term in (3) to uphold equality constraints. Since there are only inequality constraints here, we’ll omit the extra term.

Constructing the dual problem

Much of the following discussion is based off ref [2], which has a nice introduction to duality in the context of SVMs.

First, let’s make the following observation:

maxαL(w,b,α)={f(w),∞,if gi(w)≤0,(constraints satisfied)if gi(w)>0,(constraints violated)(Obs. 1)

Basically, if any constraint j is violated, i.e. gj(w)>0, then the Lagrange multiplier αj that is multiplying gj(w) can be made arbitrarily large (→∞) in order to maximize L.

On the other hand, if all the constraints are satisfied, gi(w)≤0 ∀i, then L is maximized by setting the αis that are multiplying negative quantities equal to zero. However, Lagrangian multipliers multiplying gi(w) that satisfy the constraints with equality, gi(w)=0, can be non-zero without diminishing L.

The last statement amounts to the property of “complementary slackness” in the Karush-Kuhn-Tucker conditions for the solution:

αigi(w)=0(4)

Recall from the original geometric picture: only a few points lie exactly on the margins, and those points are described by gi(w)=0 (and thus have non-zero Lagrange multipliers). The points on the margin are the support vectors.

Next, we make use of the Max-Min inequality:

maxαminw,bL(w,b,α)≤minw,bmaxαL(w,b,α)

This inequality is an equality under certain conditions, which our problem satisfies (convex f and g). The left side of the inequality is called the dual problem, and the right side is the primal problem.

Now we can put it all together: Observation 1 tells us that solving the right side (primal problem) of the Max-Min inequality is the same as solving the original problem. Because our problem is convex, solving the left side (dual) is equivalent to solving the primal problem by the Max-Min inequality.

Thus we’re set to approach the solution via the dual problem, which is useful for dealing with nonlinear decision boundaries.

Solving the dual problem

The dual problem to solve is maxαminw,bL(w,b,α), subject to constraints* on the Lagrange multipliers: αi≥0.

Let’s work out the inner part of the expression explicitly. We obtain minw,bL(w,b,α) by setting:

∇wL=0;∂bL=0

These equations for the partial derivatives give us, respectively:

w=∑iαiyixi;∑iαiyi=0

w is a linear combination of the coordinates of the training data. Only the support vectors, which have non-zero αi, contribute to the sum. To predict the label for a new test point xt, simply evaluate the sign of

w⋅x+b=∑iαiyixi⋅xt+b(5)

where b can be computed from the KKT complementarity condition (4) by plugging in the values for any support vector. The equation for the separating hyperplane is entirely determined by the support vectors.

Plugging the last two equations into L leads to the dual formulation of the problem maxαLD:

maxαsubject to∑iαi–(1/2)∑i,jαiαjyiyjxi⋅xjαi≥0,i=1…m∑iαiyi=0(6)

The dual for the non-separable primal Lagrangian (2) — derived using the same procedure we just followed — looks just like the dual for the separable case (6), except that the Lagrange multipliers are bounded from above by the regularization constant: 0≤αi≤C. Notably, the slack variables ξi do not appear in the dual of the soft margin SVM.

The dual (called the Wolfe dual) is easier to solve because of the simpler form of its inequality constraints and is the form used in algorithms such as the Sequential Minimal Optimizationalgorithm, which is implemented in the popular SVM solver, LIBSVM. The key feature of the dual is that training vectors only appear as dot products xi⋅xj. This property allows us to generalize to the nonlinear case via the “kernel trick” discussed in Part 3 of this post.

* Some of you may be familiar with using Lagrangian multipliers to optimize some function f(x)subject to equality constraints g(x)=0, in which case the Lagrangian multipliers are unconstrained. The Karush-Kuhn-Tucker conditions generalize the method to include inequality constraints g(x)≤0, which results in additional constraints on the associated Lagrangian multipliers (as we have here).


Part 3: Kernels

Data that is not linearly separable in the original input space may be separable if mapped to a different space. Consider the following example of nonlinearly separable, two-dimensional data:

However, if we map the 2-d input data x=(x,y) to 3-d feature space by a function Φ(x)=(x,y,x2+y2), the blue and red points can be separated with a plane in the new (3-d) space. See the plot below of the decision boundary, the mapped points, as well as the the original data points in the x-y plane. Drag the figure to rotate it, or zoom in and out with your mouse wheel!

Code to generate and fit the data in this example with scikit-learn’s SVM module, as well as code to create the plot.ly interactive plots above, is available in IPython notebooks on github.

From maps to kernels

So how do we incorporate mapping the data into the formulation of the problem?

Recall that the data appears as a dot product in the dual Lagrangian (6). If we decide to train an SVM on the mapped data, then the dot product of the input data in (6) is replaced by the dot product of the mapped data: xi⋅xj→Φ(xi)⋅Φ(xj)

The kernel is simply the dot product of the mapping functions. In the example above, the inner product of the mapping function is an instance of a polynomial kernel:

K(xi,xj)=Φ(xi)⋅Φ(xj)=xixj+yiyj+(x2i+y2i)(x2j+y2j)

In practice, we work directly with the kernel K(xi,xj) rather than explicitly computing the map of the data points**. Computing the kernel directly allows us to sidestep the computationally expensive operation of mapping data to a high dimensional space and then taking a dot product (see ref [2] for examples comparing computational times of the two methods).

Using a kernel, the second term in the objective of the dual problem (6) becomes

∑i,jαiαjyiyjK(xi,xj)

The kernel also appears in the evaluation of (5) to predict the classification of a test point xt:

sgn(∑iαiyiK(xi,xt)+b)(8)

Which functions are valid kernels to use in the kernel trick? i.e. given K(xi,xj), does some feature map Φ exist such that K(xi,xj)=Φ(xi)⋅Φ(xj) for any i, j? Mercer’s condition states that a necessary and sufficient condition for K to be a valid kernel is that it is symmetric and positive semi-definite.

Some popular kernels are:

polynomial:Gaussian radial basis function:(xi⋅xj+c)pexp(−∥xi–xj∥2/2σ2)

The optimal parameters for the degree of the polynomial p and spread of the Gaussian σ (as well as the regularization parameter) are determined by cross-validation. Computing the above kernels takes O(d) time, where d is the dimension of the input space, since we have to evaluate xi⋅xj in the polynomial kernel and xi–xj in the Gaussian kernel.

Comparing runtimes of linear and nonlinear kernels

The computational complexity for classification/prediction, i.e. at test time, can be obtained by eyeballing (5) and (8). Let d be the dimension of the input space and n be the size of the training set, and assume the number of support vectors nS is some fraction of n, nS∼O(n).

In the case of working with the linear kernel/original input space, w can be explicitly evaluated to obtain the separating hyperplane parameters, so that classification in (5) takes O(d) time. On the other hand, with the kernel trick, the hyperplane parameters are not explicitly evaluated. Assume calculating a kernel takes O(d) time, cf. the polynomial and Gaussian kernels; then test time for a nonlinear K in (8) takes O(nd) time.

Estimating the computational complexity for training is complicated, so we defer the discussion to refs [4a4b] and simply state the result: training for linear kernels is O(nd) while training for nonlinear kernels using the Sequential Minimal Optimization algorithm is O(n2) to O(n3)(dependent on the regularization parameter C), making nonlinear kernel SVMs impractical for larger datasets (a couple of 10,000 samples according to scikit-learn).

** More than one mapping and feature space (dimension) may exist for a particular kernel. See section 4 of ref [1] for examples.

† See ref [2] for a simple proof in terms of the Kernel (Gram) matrix, i.e. the kernel function evaluated on a finite set of points.


Discussion

We’ve glimpsed the elegant theory behind the construction of SVMs and seen how support vectors pop out of the mathematical machinery. Geometrically, the support vectors are the points lying on the margins of the decision boundary.

How about using SVMs in practice?

In his Coursera course, Professor Ng recommends linear and Gaussian kernels for most use cases. He also provides some rules of thumb (based on the current state of SVM algorithms) for different sample sizes n and input data dimension/number of features d, restated here:

case method n d
n?d, e.g. genomics, bioinformatics data linear kernel SVM or logistic regression 10 – 1000 10,000
n intermediate, d small Gaussian kernel SVM 10 – 10,000 1 – 1000
n?d create features, then linear kernel SVM or logistic regression 50,000+ 1 – 1000

The creators of the LIBSVM and LIBLINEAR packages also provide a user’s guide for novices, which includes a study of when to use linear instead of radial basis function kernels. They recommend linear SVMs when d and n are both large, often encountered in document classification problems where bag-of-words approaches can generate huge numbers of features (in their example n=20,000, d= 47,000).

The idea is that if the input data is already high-dimensional, then it shouldn’t be necessary to apply nonlinear transformations to it in order to obtain a separating hyperplane.

Tip: LIBLINEAR is specifically optimized for linear kernels and should be used instead of LIBSVM in the linear case.

Further reading

In addition to the many excellent written tutorials on SVMs online, we highly recommend viewing lectures 14 and 15 of Yaser Abu-Mostafa’s MOOC, Learning from Data, which cover SVMs at about the same level as this post, with the considerable added benefit of Professor Abu-Mostafa’s explanations. He also discusses the generalization performance of SVMs as a function of the number of support vectors using VC theory (also see [1]).

There is a lot more theory on SVMs that we haven’t touched upon. For example, SVMs can be framed as a penalization method [3] or “regularization network”, c.f. ridge regression, but with a hinge loss rather than squared error. Insights about the choice of a kernel have also been developed in that framework.



[1] Burges, C. J.C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining 2 (2) 121-167.

[2] Ng, A. Support Vector Machines [PDF document]. Retrieved from lecture notes online: http://cs229.stanford.edu/notes/cs229-notes3.pdf
Lecture notes by Andrew Ng for a more advanced class (but still in his signature intuitive style).

[3] Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning.
See section 12.2.1, page 420, for derivation of the dual Lagrangian for the nonseparable case.

[4a] Bottou, L. and Lin C-J., (2006). Support Vector Machine Solvers.
[4b] Chang, C-C. and Lin C-J., (2013). LIBSVM: A Library for Support Vector Machines.

[5] Ben-Hur, A. and Weston, J. (2009). A User’s Guide to Support Vector Machines. In Carugo, O. and Eisenhaber, F. (Eds.), Methods in Molecular Biology 609, 223-229.

Andrew Ng photo credit: InverseHypercubecreative commons license.

时间: 2024-08-24 21:10:07

Support Vector Machines for classification的相关文章

Machine Learning - XII. Support Vector Machines (Week 7)

http://blog.csdn.net/pipisorry/article/details/44522881 机器学习Machine Learning - Andrew NG courses学习笔记 Support Vector Machines支持向量机 {SVM sometimes gives a cleaner and more powerful way of learning complex nonlinear functions} Optimization Objective优化目标

(原创)Stanford Machine Learning (by Andrew NG) --- (week 7) Support Vector Machines

本栏目内容来源于Andrew NG老师讲解的SVM部分,包括SVM的优化目标.最大判定边界.核函数.SVM使用方法.多分类问题等,Machine learning课程地址为:https://www.coursera.org/course/ml 大家对于支持向量机(SVM)可能会比较熟悉,是个强大且流行的算法,有时能解决一些复杂的非线性问题.我之前用过它的工具包libsvm来做情感分析的研究,感觉效果还不错.NG在进行SVM的讲解时也同样建议我们使用此类的工具来运用SVM. (一)优化目标(Opt

scikit-learn(工程中用的相对较多的模型介绍):1.4. Support Vector Machines

参考:http://scikit-learn.org/stable/modules/svm.html 在实际项目中,我们真的很少用到那些简单的模型,比如LR.kNN.NB等,虽然经典,但在工程中确实不实用. 今天我们关注在工程中用的相对较多的SVM. SVM功能不少:Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outl

初译 Support Vector Machines:A Simple Tutorial(一)

从本次开始我将开始尝试着逐章翻译一下 Alexey Nefedov的<Support Vector Machines:A Simple Tutorial>这本教材,这可是我们导师极力推荐的SVM教材,看了好久一直感觉一脸懵逼,索性开坑翻译一下吧,也当是加深理解,毕竟我也是一知半解,如果翻译的有不对的地方还望大佬们斧正,欢迎提意见,欢迎讨论. 嗯,就是这样. (一)Introduction 在本章节中将会介绍一些用于定义支持向量机(SVM)的基础的概念,这些概念对于理解SVM至关重要,假定读者了

OpenCV Tutorials &mdash;&mdash; Support Vector Machines for Non-Linearly Separable Data

与上一篇类似 ~~ 只不过是非线性数据 #include "stdafx.h" #include <iostream> #include <opencv2/core/core.hpp> #include <opencv2/highgui/highgui.hpp> #include <opencv2/ml/ml.hpp> #define NTRAINING_SAMPLES 100 // Number of training samples

[机器学习] Coursera笔记 - Support Vector Machines

序言 机器学习栏目记录我在学习Machine Learning过程的一些心得笔记,包括在线课程或Tutorial的学习笔记,论文资料的阅读笔记,算法代码的调试心得,前沿理论的思考等等,针对不同的内容会开设不同的专栏系列. 机器学习是一个令人激动令人着迷的研究领域,既有美妙的理论公式,又有实用的工程技术,在不断学习和应用机器学习算法的过程中,我愈发的被这个领域所吸引,只恨自己没有早点接触到这个神奇伟大的领域!不过我也觉得自己非常幸运,生活在这个机器学习技术发展如火如荼的时代,并且做着与之相关的工作

Machine Learning in Action -- Support Vector Machines

虽然SVM本身算法理论,水比较深,很难懂 但是基本原理却非常直观易懂,就是找到与训练集中支持向量有最大间隔的超平面 形式化的描述: 其中需要满足m个约束条件,m为数据集大小,即数据集中的每个数据点function margin都是>=1,因为之前假设所有支持向量,即离超平面最近的点,的function margin为1 对于这种有约束条件的最优化问题,用拉格朗日定理,于是得到如下的形式, 现在我们的目的就是求出最优化的m个拉格朗日算子,因为通过他们我们可以间接的算出w和b,从而得到最优超平面 考

机器学习 第七讲:Support Vector Machines 1

引言 这一讲及接下来的几讲,我们要介绍supervised learning 算法中最好的算法之一:Support Vector Machines (SVM,支持向量机).为了介绍支持向量机,我们先讨论"边界"的概念,接下来,我们将讨论优化的边界分类器,并将引出拉格朗日数乘法.我们还会给出 kernel function 的概念,利用 kernel function,可以有效地处理高维(甚至无限维数)的特征向量,最后,我们会介绍SMO算法,该算法说明了如何高效地实现SVM. Margi

斯坦福CS229机器学习课程笔记五:支持向量机 Support Vector Machines

SVM被许多人认为是有监督学习中最好的算法,去年的这个时候我就在尝试学习.不过,面对长长的公式和拗口的中文翻译最终放弃了.时隔一年,看到Andrew讲解SVM,总算对它有了较为完整的认识,总体思路是这样的:1.介绍间隔的概念并重新定义符号:2.分别介绍functional margins与geometric margins:3.由此引出最大间隔分类器,同时将最优化问题转化为凸函数的优化问题:4.补充了拉格朗日对偶性的知识:5.利用拉格朗日对偶性,推导最大间隔分类器最优化的对偶问题,即SVM的最优