每日一个机器学习算法——adaboost

在网上找到一篇好文,直接粘贴过来,加上一些补充和自己的理解,算作此文。

My education in the fundamentals of machine learning has mainly come from Andrew Ng’s excellent Coursera course on the topic. One thing that wasn’t covered in that course, though, was the topic of “boosting” which I’ve come across in a number of different contexts now. Fortunately, it’s a relatively straightforward topic if you’re already familiar with machine learning classification.

Whenever I’ve read about something that uses boosting, it’s always been with the “AdaBoost” algorithm, so that’s what this post covers.

AdaBoost is a popular boosting technique which helps you combine multiple “weak classifiers” into a single “strong classifier”. A weak classifier is simply a classifier that performs poorly, but performs better than random guessing. A simple example might be classifying a person as male or female based on their height. You could say anyone over 5′ 9″ is a male and anyone under that is a female. You’ll misclassify a lot of people that way, but your accuracy will still be greater than 50%.

AdaBoost can be applied to any classification algorithm, so it’s really a technique that builds on top of other classifiers as opposed to being a classifier itself.

You could just train a bunch of weak classifiers on your own and combine the results, so what does AdaBoost do for you? There’s really two things it figures out for you:
1. It helps you choose the training set for each new classifier that you train based on the results of the previous classifier.
2. It determines how much weight should be given to each classifier’s proposed answer when combining the results.

Training Set Selection

Each weak classifier should be trained on a random subset of the total training set. The subsets can overlap–it’s not the same as, for example, dividing the training set into ten portions. AdaBoost assigns a “weight” to each training example, which determines the probability that each example should appear in the training set. Examples with higher weights are more likely to be included in the training set, and vice versa. After training a classifier, AdaBoost increases the weight on the misclassified examples so that these examples will make up a larger part of the next classifiers training set, and hopefully the next classifier trained will perform better on them.

The equation for this weight update step is detailed later on.

Classifier Output Weights

After each classifier is trained, the classifier’s weight is calculated based on its accuracy. More accurate classifiers are given more weight. A classifier with 50% accuracy is given a weight of zero, and a classifier with less than 50% accuracy (kind of a funny concept) is given negative weight.

Formal Definition

To learn about AdaBoost, I read through a tutorial written by one of the original authors of the algorithm, Robert Schapire. The tutorial is available here.

Below, I’ve tried to offer some intuition into the relevant equations.

Let’s look first at the equation for the final classifier.

The final classifier consists of ‘T’ weak classifiers. h_t(x) is the output of weak classifier ‘t’ (in this paper, the outputs are limited to -1 or +1). Alpha_t is the weight applied to classifier ‘t’ as determined by AdaBoost. So the final output is just a linear combination of all of the weak classifiers, and then we make our final decision simply by looking at the sign of this sum.

The classifiers are trained one at a time. After each classifier is trained, we update the probabilities of each of the training examples appearing in the training set for the next classifier.

The first classifier (t = 1) is trained with equal probability given to all training examples. After it’s trained, we compute the output weight (alpha) for that classifier.

The output weight, alpha_t, is fairly straightforward. It’s based on the classifier’s error rate, ‘e_t’. e_t is just the number of misclassifications over the training set divided by the training set size.

Here’s a plot of what alpha_t will look like for classifiers with different error rates.

There are three bits of intuition to take from this graph:

  1. The classifier weight grows exponentially as the error approaches 0. Better classifiers are given exponentially more weight.
  2. The classifier weight is zero if the error rate is 0.5. A classifier with 50% accuracy is no better than random guessing, so we ignore it.
  3. The classifier weight grows exponentially negative as the error approaches 1. We give a negative weight to classifiers with worse worse than 50% accuracy. “Whatever that classifier says, do the opposite!”.

After computing the alpha for the first classifier, we update the training example weights using the following formula.

The variable D_t is a vector of weights, with one weight for each training example in the training set. ‘i’ is the training example number. This equation shows you how to update the weight for the ith training example.

The paper describes D_t as a distribution. This just means that each weight D(i) represents the probability that training example i will be selected as part of the training set.

To make it a distribution, all of these probabilities should add up to 1. To ensure this, we normalize the weights by dividing each of them by the sum of all the weights, Z_t. So, for example, if all of the calculated weights added up to 12.2, then we would divide each of the weights by 12.2 so that they sum up to 1.0 instead.

This vector is updated for each new weak classifier that’s trained. D_t refers to the weight vector used when training classifier ‘t’.

This equation needs to be evaluated for each of the training samples ‘i’ (x_i, y_i). Each weight from the previous training round is going to be scaled up or down by this exponential term.

To understand how this exponential term behaves, let’s look first at how exp(x) behaves.

The function exp(x) will return a fraction for negative values of x, and a value greater than one for positive values of x. So the weight for training sample i will be either increased or decreased depending on the final sign of the term “-alpha * y * h(x)”. For binary classifiers whose output is constrained to either -1 or +1, the terms y and h(x) only contribute to the sign and not the magnitude.

y_i is the correct output for training example ‘i’, and h_t(x_i) is the predicted output by classifier t on this training example. If the predicted and actual output agree, y * h(x) will always be +1 (either 1 * 1 or -1 * -1). If they disagree, y * h(x) will be negative.

Ultimately, misclassifications by a classifier with a positive alpha will cause this training example to be given a larger weight. And vice versa.

Note that by including alpha in this term, we are also incorporating the classifier’s effectiveness into consideration when updating the weights. If a weak classifier misclassifies an input, we don’t take that as seriously as a strong classifier’s mistake.

概括的说,-h*y则决定了分类结果的直接代价,这个代价反映到指数的正负上是巨大的,算作"主激励"。而alpha则对应于错分或者正确分类的情况下,惩罚或奖励的幅度,属于"附激励"。

Practical Application

One of the biggest applications of AdaBoost that I’ve encountered is the Viola-Jones face detector, which seems to be the standard algorithm for detecting faces in an image. The Viola-Jones face detector uses a “rejection cascade” consisting of many layers of classifiers. If at any layer the detection window is not recognized as a face, it’s rejected and we move on to the next window. The first classifier in the cascade is designed to discard as many negative windows as possible with minimal computational cost.

In this context, AdaBoost actually has two roles. Each layer of the cascade is a strong classifier built out of a combination of weaker classifiers, as discussed here. However, the principles of AdaBoost are also used to find the best features to use in each layer of the cascade.

The rejection cascade concept seems to be an important one; in addition to the Viola-Jones face detector, I’ve seen it used in a couple of highly-cited person detector algorithms (here and here). If you’re interested in learning more about the rejection cascade technique, I recommend reading the original paper, which I think is very clear and well written. (Note that the topics of Haar wavelet features and integral images are not essential to the concept of rejection cascades).

adaboost官方算法流程图:

时间: 2024-10-15 01:30:19

每日一个机器学习算法——adaboost的相关文章

每日一个机器学习算法——机器学习实践

知道某个算法,和运用一个算法是两码事儿. 当你训练出数据后,发觉模型有太大误差,怎么办? 1)获取更多的数据.也许有用吧. 2)减少特征维度.你可以自己手动选择,也可以利用诸如PCA等数学方法. 3)获取更多的特征.当然这个方法很耗时,而且不一定有用. 4)添加多项式特征.你在抓救命稻草么? 5)构建属于你自己的,新的,更好的特征.有点儿冒险. 6)调整正则化参数lambuda. 以上方法的尝试有些碰运气,搞不好就是浪费大把时间. machine learning diagonostic. 机器

每日一个机器学习算法——正则化

在对数据进行拟合,学习模型的过程中,会出现以下情况: 1)high variance, overfitting.过拟合 2)high bias, underfiiting.欠拟合 过拟合出现的原因 1)太多的特征. 2)过少的训练数据. 如何解决? 1)减少特征数 2)模型选择算法(model selection algorithm) 3)正则化:保留特征参数,但尽可能减小其幅值为0. lambuda为正则化参数:看做是一个tradeoff.用于平衡以下两项 1)更好的适应模型 2)将特征的系数

每日一个机器学习算法——k近邻分类

K近邻很简单. 简而言之,对于未知类的样本,按照某种计算距离找出它在训练集中的k个最近邻,如果k个近邻中多数样本属于哪个类别,就将它判决为那一个类别. 由于采用k投票机制,所以能够减小噪声的影响. 由于KNN方法主要靠周围有限的邻近的样本,而不是靠判别类域的方法来确定所属类别的,因此对于类域的交叉或重叠较多的待分样本集来说,KNN方法较其他方法更为适合. 一个不足之处是计算量较大,因为对每一个待分类的样本都要计算它到全体已知样本的距离,才能求得它的K个最近邻点.

每日一个机器学习算法——信息熵

1 定义 2 直观解释 信息熵用来衡量信息量的大小 若不确定性越大,则信息量越大,熵越大 若不确定性越小,则信息量越小,熵越小 比如A班对B班,胜率一个为x,另一个为1-x 则信息熵为 -(xlogx + (1-x)log(1-x)) 求导后容易证明x=1/2时取得最大,最大值为2 也就是说两者势均力敌时,不确定性最大,熵最大. 3 应用 数据挖掘中的决策树. 构建决策树的过程,就是减小信息熵,减小不确定性.从而完整构造决策树模型. 所以我们需要在每一次选择分支属性时,计算这样分类所带来的信息熵

简单易学的机器学习算法——AdaBoost

一.集成方法(Ensemble Method) 集成方法主要包括Bagging和Boosting两种方法,随机森林算法是基于Bagging思想的机器学习算法,在Bagging方法中,主要通过对训练数据集进行随机采样,以重新组合成不同的数据集,利用弱学习算法对不同的新数据集进行学习,得到一系列的预测结果,对这些预测结果做平均或者投票做出最终的预测.AdaBoost算法和GBDT(Gradient Boost Decision Tree,梯度提升决策树)算法是基于Boosting思想的机器学习算法.

机器学习算法-Adaboost

本章内容 组合类似的分类器来提高分类性能 应用AdaBoost算法 处理非均衡分类问题 主题:利用AdaBoost元算法提高分类性能 1.基于数据集多重抽样的分类器 - AdaBoost 长处 泛化错误率低,易编码,能够应用在大部分分类器上,无需參数调整 缺点 对离群点敏感 适合数据类型 数值型和标称型数据 bagging:基于数据随机重抽样的分类器构建方法 自举汇聚法(bootstrap aggregating),也称为bagging方法,是从原始数据集选择S次后得到S个新数据集的一种技术.

机器学习---算法---Adaboost

转自:https://blog.csdn.net/px_528/article/details/72963977 写在前面说到Adaboost,公式与代码网上到处都有,<统计学习方法>里面有详细的公式原理,Github上面有很多实例,那么为什么还要写这篇文章呢?希望从一种更容易理解的角度,来为大家呈现Adaboost算法的很多关键的细节. 本文中暂时没有讨论其数学公式,一些基本公式可以参考<统计学习方法>. 基本原理Adaboost算法基本原理就是将多个弱分类器(弱分类器一般选用单

机器学习算法-梯度树提升GTB(GBRT)

Introduction 决策树这种算法有着很多良好的特性,比如说训练时间复杂度较低,预测的过程比较快速,模型容易展示(容易将得到的决策树做成图片展示出来)等.但是同时,单决策树又有一些不好的地方,比如说容易over-fitting,虽然有一些方法,如剪枝可以减少这种情况,但是还是不太理想. 模型组合(比如说有Boosting,Bagging等)与决策树相关的算法比较多,如randomForest.Adaboost.GBRT等,这些算法最终的结果是生成N(可能会有几百棵以上)棵树,这样可以大大的

程序员初学机器学习算法

英文原文:4 Self-Study Machine Learning Projects 学习机器学习有很多方法,大多数人选择从理论开始. 如果你是个程序员,那么你已经掌握了把问题拆分成相应组成部分及设计小项目原型的能力,这些能力能帮助你学习新的技术.类库和方法.这些对任何一个职业程序员来说都是重要的能力,现在它们也能用在初学机器学习上. 要想有效地学习机器学习你必须学习相关理论,但是你可以利用你的兴趣及对知识的渴望,来激励你从实际例子学起,然后再步入对算法的数学理解. 通过本文你可以学习到程序员