Quaro上的问答,我感觉回答的非常好!
What are the advantages of different classification algorithms?
For instance, if we have large training data set with approx more than 10000 instances and more than 100000 features ,then which classifier will be best to choose to classify the test data set .
--------------------------
Here are some general guidelines I‘ve found over the years.
How large is your training set?
If your training set is small, high bias/low variance classifiers (e.g., Naive Bayes) have an advantage over low bias/high variance classifiers (e.g., kNN or logistic regression), since the latter will overfit. But low bias/high variance classifiers start to win out as your training set grows (they have lower asymptotic error), since high bias classifiers aren‘t powerful enough to provide accurate models.
(我的理解,NB是一种high bias/low variance的模型,所以对于小训练集能够取得很好的效果,但是由于模型本身分类能力的欠缺(high bias),当数据集变大时,分类能力也不会变的更好。)
You can also think of this as a generative model vs. discriminative model distinction.
Advantages of some particular algorithms
Advantages of Naive Bayes: Super simple, you‘re just doing a bunch of counts. If the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. And even if the NB assumption doesn‘t hold, a NB classifier still often performs surprisingly well in practice. A good bet if you want to do some kind of semi-supervised learning, or want something embarrassingly simple that performs pretty well.
(优点:模型简单;在符合独立假设的情况下,收敛速度快;能在小数据集上很好的work;适合于多分类问题)
(缺点:有一个独立假设的条件;high bias,数据集增大,分类效果提升有限)
Advantages of Logistic Regression: Lots of ways to regularize your model, and you don‘t have to worry as much about your features being correlated, like you do in Naive Bayes. You also have a nice probabilistic interpretation, unlike decision trees or SVMs, and you can easily update your model to take in new data (using an online gradient descent method), again unlike decision trees or SVMs. Use it if you want a probabilistic framework (e.g., to easily adjust classification thresholds, to say when you‘re unsure, or to get confidence intervals) or if you expect to receive more training data in the future that you want to be able to quickly incorporate into your model.
(优点:有一个概率解释;可以很方便的把新数据考虑到模型中(online update))
(缺点:只能处理线性可分的问题)
Advantages of Decision Trees: Easy to interpret and explain (for some people -- I‘m not sure I fall into this camp). Non-parametric, so you don‘t have to worry about outliers or whether the data is linearly separable (e.g., decision trees easily take care of cases where you have class A at the low end of some feature x, class B in the mid-range of feature x, and A again at the high end). Their main disadvantage is that they easily overfit, but that‘s where ensemble methods like random forests (or boosted trees) come in. Plus, random forests are often the winner for lots of problems in classification (usually slightly ahead of SVMs, I believe), they‘re fast and scalable, and you don‘t have to worry about tuning a bunch of parameters like you do with SVMs, so they seem to be quite popular these days.
(优点:容易解释;无参数;不用考虑是否线性可分)
(缺点:容易过拟合)
Advantages of SVMs: High accuracy, nice theoretical guarantees regarding overfitting, and with an appropriate kernel they can work well even if you‘re data isn‘t linearly separable in the base feature space. Especially popular in text classification problems where very high-dimensional spaces are the norm. Memory-intensive and kind of annoying to run and tune, though, so I think random forests are starting to steal the crown.
(优点:准确率高;不容易过拟合;能够处理线性不可分的问题)
(缺点:较大的内存需求和繁琐的调参)
KNN:
(优点: 思想简单;能够处理线性不可分;对异常点不敏感)
(缺点:分类一次的时间复杂度和空间复杂度都大)
To go back to the particular question of logistic regression vs. decision trees (which I‘ll assume to be a question of logistic regression vs. random forests) and summarize a bit: both are fast and scalable, random forests tend to beat out logistic regression in terms of accuracy, but logistic regression can be updated online and gives you useful probabilities. And since you‘re at Square (not quite sure what an inference scientist is, other than the embodiment of fun) and possibly working on fraud detection: having probabilities associated to each classification might be useful if you want to quickly adjust thresholds to change false positive/false negative rates, and regardless of the algorithm you choose, if your classes are heavily imbalanced (as often happens with fraud), you should probably resample the classes or adjust your error metrics to make the classes more equal.
But...
Recall, though, that better data often beats better algorithms, and designing good features goes a long way. And if you have a huge dataset, your choice of classification algorithm might not really matter so much in terms of classification performance (so choose your algorithm based on speed or ease of use instead).
And if you really care about accuracy, you should definitely try a bunch of different classifiers and select the best one by cross-validation. Or, to take a lesson from the Netflix Prize and Middle Earth, just use an ensemble method to choose them all!