想学习一下SVM,所以找到了LIBSVM--A Library for Support Vector Machines,首先阅读了一下网站提供的A practical guide to SVM classification.
写一写个人认为主要的精华的东西。
SVMs is:a technique for data classification
Goal is:to produce a model (based on training data) which predicts the target values of the test data given only the test data attributes.
Kernels:four basic kernels
Proposed Procedure:
1.transform data to the format of an SVM package
first have to convert categorical attributes into numeric data.We recommend using m numbers to represent an m-category attribute and only one of the m numbers is one,and others are zeros. for example {red,green,blue} can be represented as (0,0,1),(0,1,0)and(1,0,0).
2.conduct simple scaling on the data
Note:It‘s importance to use the same scaling factors for training and testing sets.
3.consider the RBF kernel K(x,y) = e-r||x-y||2
4.use cross-validation to find the best parameter C and r
The cross-validation produce can prevent the overfitting problem.We recommend a "grid-search" on C and r using cross-validation.Various pairs of (C,r)values are tried and the one with the best cross-validation accuarcy is picked.Use a coarse grid to make a better region on the grid,a finer grid search on that region can be conducted.
For very large data sets a feasible approach is to randomly choose a subset of the data set,conduct grid-search on them,and then do a better-region-only grid-search on the completly data set.
5.use the best parameter C and r to train the whole training set
6.Test
When to use Linear but not RBF Kernel ?
If the number of features is large, one may not need to map data to a higher dimensional space. That is, the nonlinear mapping does not improve the performance.Using the linear kernel is good enough, and one only searches for the parameter C.
C.1 Number of instances number of features
when the number of features is very large, one may not need to map the data.
C.2 Both numbers of instances and features are large
Such data often occur in document classication.LIBLINEAR is much faster than LIBSVM to obtain a model with comparable accuracy.LIBLINEAR is efficient for large-scale document classication.
C.3 Number of instances number of features
As the number of features is small, one often maps data to higher dimensional spaces(i.e., using nonlinear kernels).