scikit-learn（工程中用的相对较多的模型介绍）：1.4. Support Vector Machines

参考：http://scikit-learn.org/stable/modules/svm.html

在实际项目中，我们真的很少用到那些简单的模型，比如LR、kNN、NB等，虽然经典，但在工程中确实不实用。

今天我们关注在工程中用的相对较多的SVM。

SVM功能不少：Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers
detection.

好处多多：高维空间的高效率；维度大于样本数的有效性；仅使用训练点的子集（称作支持向量），空间占用少；有不同的kernel functions供选择。

也有坏处：维度大于样本数的有效性----但维度如果相对样本数过高，则效果会非常差；不能直接提供概率估计，需要通过an expensive five-fold cross-validation (see Scores
and probabilities, below).才能实现。

（SVM支持dense和sparse sample vectors，但是如果预测使用的sparse data，那训练也要使用稀疏数据。为了发挥SVM效用，请use
C-ordered numpy.ndarray (dense)
or scipy.sparse.csr_matrix (sparse)
with dtype=float64.）

1、分类

SVC, NuSVC and LinearSVC 是三个可以进行multi-class分类的模型。三者的本质区别就是 have
different mathematical formulations，具体参考本文最后的公式。

SVC, NuSVC and LinearSVC 和其他分类器一样，使用fit、predict方法：

>>> from sklearn import svm
>>> X = [[0, 0], [1, 1]]
>>> y = [0, 1]
>>> clf = svm.SVC()
>>> clf.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.0, kernel=‘rbf‘, max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)

After being fitted, the model can then be used to predict new values:

>>>

>>> clf.predict([[2., 2.]])
array([1])

SVM中的支持向量的相关属性可以使用 support_vectors_, support_ and n_support来获取：

>>> # get support vectors
>>> clf.support_vectors_
array([[ 0.,  0.],
       [ 1.,  1.]])
>>> # get indices of support vectors
>>> clf.support_
array([0, 1]...)
>>> # get number of support vectors for each class
>>> clf.n_support_
array([1, 1]...)

对于multi-class分类：

SVC and NuSVC 的机制是“one-against-one”（training n_class * (n_class - 1) / 2个 models），而 LinearSVC 的策略是“one-vs-the-rest”（training n_class个 models）
。而实践中，one-vs-rest是常用和较好的，因为结果其实差不多，但时间省好多。。。

[python] view
plain copy

>>> X = [[0], [1], [2], [3]]
>>> Y = [0, 1, 2, 3]
>>> clf = svm.SVC()
>>> clf.fit(X, Y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.0, kernel=‘rbf‘, max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes: 4*3/2 = 6
6
>>> lin_clf = svm.LinearSVC()
>>> lin_clf.fit(X, Y)
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, loss=‘squared_hinge‘, max_iter=1000,
multi_class=‘ovr‘, penalty=‘l2‘, random_state=None, tol=0.0001,
verbose=0)
>>> dec = lin_clf.decision_function([[1]])
>>> dec.shape[1]
4

关于样本所属类别的confidence：The SVC method decision_function gives
per-class scores for each sample。另外还有所谓的option probability，但是，If
confidence scores are required, but these do not have to be probabilities, then it is advisable to set probability=False and
use decision_function instead
of predict_proba.（主要是因为probability的理论背景有缺陷）

在每个class或者sample的权重不同的情况下，可以设置keywords class_weight andsample_weight ：

类别权重：SVC (but
not NuSVC)
implement a keyword class_weight in
the fit method.
It’s a dictionary of the form {class_label : value}, where
value is a floating point number > 0 that sets the parameter C of
class class_label to C * value.

样本权重：SVC, NuSVC, SVR, NuSVR and OneClassSVM implement
also weights for individual samples in method fit through
keyword sample_weight.
Similar to class_weight,
these set the parameter C for
the i-th example to C * sample_weight[i].

最后给几个例子：

2、回归

Support Vector Regression.

看能明白这句话不能：Analogously（to
SVClassfication）, the model produced by Support Vector Regression depends only on a subset of the training data, because the cost function for building the model ignores any training data close to the model prediction.

同样也是三个模型： SVR, NuSVR and LinearSVR。

>>> from sklearn import svm
>>> X = [[0, 0], [2, 2]]
>>> y = [0.5, 2.5]
>>> clf = svm.SVR()
>>> clf.fit(X, y)
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.0,
    kernel=‘rbf‘, max_iter=-1, shrinking=True, tol=0.001, verbose=False)
>>> clf.predict([[1, 1]])
array([ 1.5])

给个例子：

Support
Vector Regression (SVR) using linear and non-linear kernels

未完待续。。。

时间： 2024-08-01 22:45:05

scikit-learn（工程中用的相对较多的模型介绍）：1.4. Support Vector Machines

scikit-learn（工程中用的相对较多的模型介绍）：1.4. Support Vector Machines的相关文章

scikit-learn（工程中用的相对较多的模型介绍）：1.12. Multiclass and multilabel algorithms

scikit-learn（工程中用的相对较多的模型介绍）：1.13. Feature selection

scikit-learn（工程中用的相对较多的模型介绍）：1.14. Semi-Supervised

Query意图分析：记一次完整的机器学习过程（scikit learn library学习笔记）

Python之扩展包安装（scikit learn）

scikit learn 模块调参 pipeline+girdsearch 数据举例：文档分类

Linear Regression with Scikit Learn

Scikit Learn安装教程

Scikit Learn