几种经典机器学习算法的比较

Quaro上的问答,我感觉回答的非常好!

What are the advantages of different classification algorithms?

For instance, if we have large training data set with approx more than 10000 instances and more than 100000 features ,then which classifier will be best to choose to classify the test data set .

--------------------------

Here are some general guidelines I‘ve found over the years.

How large is your training set?

If your training set is small, high bias/low variance classifiers (e.g., Naive Bayes) have an advantage over low bias/high variance classifiers (e.g., kNN or logistic regression), since the latter will overfit. But low bias/high variance classifiers start to win out as your training set grows (they have lower asymptotic error), since high bias classifiers aren‘t powerful enough to provide accurate models. 
(我的理解,NB是一种high bias/low variance的模型,所以对于小训练集能够取得很好的效果,但是由于模型本身分类能力的欠缺(high bias),当数据集变大时,分类能力也不会变的更好。)

You can also think of this as a generative model vs. discriminative model distinction.

Advantages of some particular algorithms

Advantages of Naive Bayes: Super simple, you‘re just doing a bunch of counts. If the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. And even if the NB assumption doesn‘t hold, a NB classifier still often performs surprisingly well in practice. A good bet if you want to do some kind of semi-supervised learning, or want something embarrassingly simple that performs pretty well.
(优点:模型简单;在符合独立假设的情况下,收敛速度快;能在小数据集上很好的work;适合于多分类问题)

(缺点:有一个独立假设的条件;high bias,数据集增大,分类效果提升有限)

Advantages of Logistic Regression: Lots of ways to regularize your model, and you don‘t have to worry as much about your features being correlated, like you do in Naive Bayes. You also have a nice probabilistic interpretation, unlike decision trees or SVMs, and you can easily update your model to take in new data (using an online gradient descent method), again unlike decision trees or SVMs. Use it if you want a probabilistic framework (e.g., to easily adjust classification thresholds, to say when you‘re unsure, or to get confidence intervals) or if you expect to receive more training data in the future that you want to be able to quickly incorporate into your model.

(优点:有一个概率解释;可以很方便的把新数据考虑到模型中(online update))

(缺点:只能处理线性可分的问题)

Advantages of Decision Trees: Easy to interpret and explain (for some people -- I‘m not sure I fall into this camp). Non-parametric, so you don‘t have to worry about outliers or whether the data is linearly separable (e.g., decision trees easily take care of cases where you have class A at the low end of some feature x, class B in the mid-range of feature x, and A again at the high end). Their main disadvantage is that they easily overfit, but that‘s where ensemble methods like random forests (or boosted trees) come in. Plus, random forests are often the winner for lots of problems in classification (usually slightly ahead of SVMs, I believe), they‘re fast and scalable, and you don‘t have to worry about tuning a bunch of parameters like you do with SVMs, so they seem to be quite popular these days.

(优点:容易解释;无参数;不用考虑是否线性可分)

(缺点:容易过拟合)

Advantages of SVMs: High accuracy, nice theoretical guarantees regarding overfitting, and with an appropriate kernel they can work well even if you‘re data isn‘t linearly separable in the base feature space. Especially popular in text classification problems where very high-dimensional spaces are the norm. Memory-intensive and kind of annoying to run and tune, though, so I think random forests are starting to steal the crown.

(优点:准确率高;不容易过拟合;能够处理线性不可分的问题)

(缺点:较大的内存需求和繁琐的调参)

KNN:

(优点: 思想简单;能够处理线性不可分;对异常点不敏感)

(缺点:分类一次的时间复杂度和空间复杂度都大)

To go back to the particular question of logistic regression vs. decision trees (which I‘ll assume to be a question of logistic regression vs. random forests) and summarize a bit: both are fast and scalable, random forests tend to beat out logistic regression in terms of accuracy, but logistic regression can be updated online and gives you useful probabilities. And since you‘re at Square (not quite sure what an inference scientist is, other than the embodiment of fun) and possibly working on fraud detection: having probabilities associated to each classification might be useful if you want to quickly adjust thresholds to change false positive/false negative rates, and regardless of the algorithm you choose, if your classes are heavily imbalanced (as often happens with fraud), you should probably resample the classes or adjust your error metrics to make the classes more equal.

But...

Recall, though, that better data often beats better algorithms, and designing good features goes a long way. And if you have a huge dataset, your choice of classification algorithm might not really matter so much in terms of classification performance (so choose your algorithm based on speed or ease of use instead).
 
And if you really care about accuracy, you should definitely try a bunch of different classifiers and select the best one by cross-validation. Or, to take a lesson from the Netflix Prize and Middle Earth, just use an ensemble method to choose them all!

时间: 2024-10-05 23:25:22

几种经典机器学习算法的比较的相关文章

七种经典排序算法最全攻略

经典排序算法在面试中占有很大的比重,也是基础.包括冒泡排序,插入排序,选择排序,希尔排序,归并排序,快速排序,堆排序.希望能帮助到有需要的同学.全部程序采用JAVA实现. 本篇博客所有排序实现均默认从小到大. 一.冒泡排序 BubbleSort 介绍: 冒泡排序的原理非常简单,它重复地走访过要排序的数列,一次比较两个元素,如果他们的顺序错误就把他们交换过来. 步骤: 比较相邻的元素.如果第一个比第二个大,就交换他们两个. 对第0个到第n-1个数据做同样的工作.这时,最大的数就"浮"到了

图解十大经典机器学习算法

图解十大经典机器学习算法 弱人工智能近几年取得了重大突破,悄然间,已经成为每个人生活中必不可少的一部分.以我们的智能手机为例,看看到底温藏着多少人工智能的神奇魔术. 下图是一部典型的智能手机上安装的一些常见应用程序,可能很多人都猜不到,人工智能技术已经是手机上很多应用程序的核心驱动力. 图1 智能手机上的相关应用 传统的机器学习算法包括决策树.聚类.贝叶斯分类.支持向量机.EM.Adaboost等等.这篇文章将对常用算法做常识性的介绍,没有代码,也没有复杂的理论推导,就是图解一下,知道这些算法是

8种常见机器学习算法比较

机器学习算法太多了,分类.回归.聚类.推荐.图像识别领域等等,要想找到一个合适算法真的不容易,所以在实际应用中,我们一般都是采用启发式学习方式来实验.通常最开始我们都会选择大家普遍认同的算法,诸如SVM,GBDT,Adaboost,现在深度学习很火热,神经网络也是一个不错的选择.假如你在乎精度(accuracy)的话,最好的方法就是通过交叉验证(cross-validation)对各个算法一个个地进行测试,进行比较,然后调整参数确保每个算法达到最优解,最后选择最好的一个.但是如果你只是在寻找一个

(转)8种常见机器学习算法比较

机器学习算法太多了,分类.回归.聚类.推荐.图像识别领域等等,要想找到一个合适算法真的不容易,所以在实际应用中,我们一般都是采用启发式学习方式来实验.通常最开始我们都会选择大家普遍认同的算法,诸如SVM,GBDT,Adaboost,现在深度学习很火热,神经网络也是一个不错的选择.假如你在乎精度(accuracy)的话,最好的方法就是通过交叉验证(cross-validation)对各个算法一个个地进行测试,进行比较,然后调整参数确保每个算法达到最优解,最后选择最好的一个.但是如果你只是在寻找一个

10种传统机器学习算法

1基于CF的推荐算法 1.1算法简介 CF(协同过滤)简单来形容就是利用兴趣相投的原理进行推荐,协同过滤主要分两类,一类是基于物品的协同过滤算法,另一种是基于用户的协同过滤算法,这里主要介绍基于物品的协同过滤算法. 给定一批用户,及一批物品,记Vi表示不同用户对物品的评分向量,那么物品i与物品j的相关性为: 上述公式是利用余弦公式计算相关系数,相关系数的计算还有:杰卡德相关系数.皮尔逊相关系数等. 计算用户u对某一物品的偏好,记用户u对物品i的评分为score(u,i),用户u对物品i的协同过滤

详解十大经典机器学习算法——EM算法

本文始发于个人公众号:TechFlow,原创不易,求个关注 今天是机器学习专题的第14篇文章,我们来聊聊大名鼎鼎的EM算法. EM算法的英文全称是Expectation-maximization algorithm,即最大期望算法,或者是期望最大化算法.EM算法号称是十大机器学习算法之一,听这个名头就知道它非同凡响.我看过许多博客和资料,但是少有资料能够将这个算法的来龙去脉以及推导的细节全部都讲清楚,所以我今天博览各家所长,试着尽可能地将它讲得清楚明白. 从本质上来说EM算法是最大似然估计方法的

九种经典排序算法汇总

/*********************************************************** 总结各种排序算法包括但不限于: 1. 插入排序类 1.1 直接插入排序 1.2 二分插入排序 1.3 希尔排序 2. 交换排序类 2.1 冒泡排序 2.2 快速排序 3. 选择排序 3.1 直接选择排序 3.2 堆排序 4. 归并排序 5. 基数排序 以上所有排序算法的实现均为将整形数组data递增排序 ************************************

九种经典排序算法详解(冒泡排序,插入排序,选择排序,快速排序,归并排序,堆排序,计数排序,桶排序,基数排序)

综述 最近复习了各种排序算法,记录了一下学习总结和心得,希望对大家能有所帮助.本文介绍了冒泡排序.插入排序.选择排序.快速排序.归并排序.堆排序.计数排序.桶排序.基数排序9种经典的排序算法.针对每种排序算法分析了算法的主要思路,每个算法都附上了伪代码和C++实现. 算法分类 原地排序(in-place):没有使用辅助数据结构来存储中间结果的排序**算法. 非原地排序(not-in-place / out-of-place):使用了辅助数据结构来存储中间结果的排序算法 稳定排序:数列值(key)

java的几种经典排序算法

排序算法大致有直接插入排序.折半插入排序.Shell排序.归并排序.直接选择排序.堆排序.冒泡排序.快速排序.桶式排序.基数排序等这些种,各个算法都有其优异性,大家不妨自己看看.下面贴上每个算法的简单讲解和实现: 1.直接选择排序(DirectSelectSort):其关键就是对n个数据要进行n-1趟比较,每趟比较的目的就是选择出本趟比较中最小的数据,并将选择出的数据放在本趟中的第一位.实现如下: [java] view plaincopy <span style="font-size:1