Gradient Boosted Regression Trees 2

Gradient Boosted Regression Trees 2

Regularization

GBRT provide three knobs to control overfitting: tree structure, shrinkage, and randomization.

Tree Structure

The depth of the individual trees is one aspect of model complexity. The depth of the trees basically control the degree of feature interactions that your model can fit. For example, if you want to capture the interaction between a feature latitude and a feature longitude your trees need a depth of at least two to capture this. Unfortunately, the degree of feature interactions is not known in advance but it is usually fine to assume that it is faily low -- in practise, a depth of 4-6 usually gives the best results. In scikit-learn you can constrain the depth of the trees using the max_depth argument.

Another way to control the depth of the trees is by enforcing a lower bound on the number of samples in a leaf: this will avoid inbalanced splits where a leaf is formed for just one extreme data point. In scikit-learn you can do this using the argument min_samples_leaf. This is effectively a means to introduce bias into your model with the hope to also reduce variance as shown in the example below:

def fmt_params(params):
    return ", ".join("{0}={1}".format(key, val) for key, val in params.iteritems())fig = plt.figure(figsize=(8, 5))ax = plt.gca()for params, (test_color, train_color) in [({}, (‘#d7191c‘, ‘#2c7bb6‘)),
                                          ({‘min_samples_leaf‘: 3},
                                           (‘#fdae61‘, ‘#abd9e9‘))]:
    est = GradientBoostingRegressor(n_estimators=n_estimators, max_depth=1, learning_rate=1.0)
    est.set_params(**params)
    est.fit(X_train, y_train)

    test_dev, ax = deviance_plot(est, X_test, y_test, ax=ax, label=fmt_params(params),
                                 train_color=train_color, test_color=test_color)
    ax.annotate(‘Higher bias‘, xy=(900, est.train_score_[899]), xycoords=‘data‘,
            xytext=(600, 0.3), textcoords=‘data‘,
            arrowprops=dict(arrowstyle="->", connectionstyle="arc"),
            )ax.annotate(‘Lower variance‘, xy=(900, test_dev[899]), xycoords=‘data‘,
            xytext=(600, 0.4), textcoords=‘data‘,
            arrowprops=dict(arrowstyle="->", connectionstyle="arc"),
            )plt.legend(loc=‘upper right‘)

Shrinkage

The most important regularization technique for GBRT is shrinkage: the idea is basically to do slow learning by shrinking the predictions of each individual tree by some small scalar, the learning_rate. By doing so the model has to re-enforce concepts. A lower learning_rate requires a higher number of n_estimatorsto get to the same level of training error -- so its trading runtime against accuracy.

fig = plt.figure(figsize=(8, 5))ax = plt.gca()for params, (test_color, train_color) in [({}, (‘#d7191c‘, ‘#2c7bb6‘)),
                                          ({‘learning_rate‘: 0.1},
                                           (‘#fdae61‘, ‘#abd9e9‘))]:
    est = GradientBoostingRegressor(n_estimators=n_estimators, max_depth=1, learning_rate=1.0)
    est.set_params(**params)
    est.fit(X_train, y_train)

    test_dev, ax = deviance_plot(est, X_test, y_test, ax=ax, label=fmt_params(params),
                                 train_color=train_color, test_color=test_color)
    ax.annotate(‘Requires more trees‘, xy=(200, est.train_score_[199]), xycoords=‘data‘,
            xytext=(300, 1.0), textcoords=‘data‘,
            arrowprops=dict(arrowstyle="->", connectionstyle="arc"),
            )ax.annotate(‘Lower test error‘, xy=(900, test_dev[899]), xycoords=‘data‘,
            xytext=(600, 0.5), textcoords=‘data‘,
            arrowprops=dict(arrowstyle="->", connectionstyle="arc"),
            )plt.legend(loc=‘upper right‘)

Stochastic Gradient Boosting

Similar to RandomForest, introducing randomization into the tree building process can lead to higher accuracy. Scikit-learn provides two ways to introduce randomization: a) subsampling the training set before growing each tree (subsample) and b) subsampling the features before finding the best split node (max_features). Experience showed that the latter works better if there is a sufficient large number of features (>30). One thing worth noting is that both options reduce runtime.

Below we show the effect of using subsample=0.5, ie. growing each tree on 50% of the training data, on our toy example:

fig = plt.figure(figsize=(8, 5))ax = plt.gca()for params, (test_color, train_color) in [({}, (‘#d7191c‘, ‘#2c7bb6‘)),
                                          ({‘learning_rate‘: 0.1, ‘subsample‘: 0.5},
                                           (‘#fdae61‘, ‘#abd9e9‘))]:
    est = GradientBoostingRegressor(n_estimators=n_estimators, max_depth=1, learning_rate=1.0,
                                    random_state=1)
    est.set_params(**params)
    est.fit(X_train, y_train)
    test_dev, ax = deviance_plot(est, X_test, y_test, ax=ax, label=fmt_params(params),
                                 train_color=train_color, test_color=test_color)
    ax.annotate(‘Even lower test error‘, xy=(400, test_dev[399]), xycoords=‘data‘,
            xytext=(500, 0.5), textcoords=‘data‘,
            arrowprops=dict(arrowstyle="->", connectionstyle="arc"),
            )est = GradientBoostingRegressor(n_estimators=n_estimators, max_depth=1, learning_rate=1.0,
                                subsample=0.5)est.fit(X_train, y_train)test_dev, ax = deviance_plot(est, X_test, y_test, ax=ax, label=fmt_params({‘subsample‘: 0.5}),
                             train_color=‘#abd9e9‘, test_color=‘#fdae61‘, alpha=0.5)ax.annotate(‘Subsample alone does poorly‘, xy=(300, test_dev[299]), xycoords=‘data‘,
            xytext=(250, 1.0), textcoords=‘data‘,
            arrowprops=dict(arrowstyle="->", connectionstyle="arc"),
            )plt.legend(loc=‘upper right‘, fontsize=‘small‘)

Hyperparameter tuning

We now have introduced a number of hyperparameters -- as usual in machine learning it is quite tedious to optimize them. Especially, since they interact with each other (learning_rate and n_estimators, learning_rate and subsample, max_depth and max_features).

We usually follow this recipe to tune the hyperparameters for a gradient boosting model:

  1. Choose loss based on your problem at hand (ie. target metric)
  2. Pick n_estimators as large as (computationally) possible (e.g. 3000).
  3. Tune max_depth, learning_rate, min_samples_leaf, and max_features via grid search.
  4. Increase n_estimators even more and tune learning_rate again holding the other parameters fixed.

Scikit-learn provides a convenient API for hyperparameter tuning and grid search:

from sklearn.grid_search import GridSearchCVparam_grid = {‘learning_rate‘: [0.1, 0.05, 0.02, 0.01],
              ‘max_depth‘: [4, 6],
              ‘min_samples_leaf‘: [3, 5, 9, 17],
              # ‘max_features‘: [1.0, 0.3, 0.1] ## not possible in our example (only 1 fx)
              }est = GradientBoostingRegressor(n_estimators=3000)# this may take some minutesgs_cv = GridSearchCV(est, param_grid, n_jobs=4).fit(X_train, y_train)# best hyperparameter settinggs_cv.best_params_
Out:{‘learning_rate‘: 0.05, ‘max_depth‘: 6, ‘min_samples_leaf‘: 5}

Use-case: California Housing

This use-case study shows how to apply GBRT to a real-world dataset. The task is to predict the log median house value for census block groups in California. The dataset is based on the 1990 censues comprising roughly 20.000 groups. There are 8 features for each group including: median income, average house age, latitude, and longitude. To be consistent with [Hastie et al., The Elements of Statistical Learning, Ed2] we use Mean Absolute Error as our target metric and evaluate the results on an 80-20 train-test split.

import pandas as pdfrom sklearn.datasets.california_housing import fetch_california_housingcal_housing = fetch_california_housing()# split 80/20 train-testX_train, X_test, y_train, y_test = train_test_split(cal_housing.data,
                                                    np.log(cal_housing.target),
                                                    test_size=0.2,
                                                    random_state=1)names = cal_housing.feature_names

Some of the aspects that make this dataset challenging are: a) heterogenous features (different scales and distributions) and b) non-linear feature interactions (specifically latitude and longitude). Furthermore, the data contains some extreme values of the response (log median house value) -- such a dataset strongly benefits from robust regression techniques such as huberized loss functions.

Below you can see histograms for some of the features and the response. You can see that they are quite different: median income is left skewed, latitude and longitude are bi-modal, and log median house value is right skewed.

import pandas as pdX_df = pd.DataFrame(data=X_train, columns=names)X_df[‘LogMedHouseVal‘] = y_train_ = X_df.hist(column=[‘Latitude‘, ‘Longitude‘, ‘MedInc‘, ‘LogMedHouseVal‘])

est = GradientBoostingRegressor(n_estimators=3000, max_depth=6, learning_rate=0.04,
                                loss=‘huber‘, random_state=0)est.fit(X_train, y_train)
GradientBoostingRegressor(alpha=0.9, init=None, learning_rate=0.04,
             loss=‘huber‘, max_depth=6, max_features=None,
             max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2,
             n_estimators=3000, random_state=0, subsample=1.0, verbose=0,
             warm_start=False)
from sklearn.metrics import mean_absolute_errormae = mean_absolute_error(y_test, est.predict(X_test))print(‘MAE: %.4f‘ % mae)

Feature importance

Often features do not contribute equally to predict the target response. When interpreting a model, the first question usually is: what are those important features and how do they contributing in predicting the target response?

A GBRT model derives this information from the fitted regression trees which intrinsically perform feature selection by choosing appropriate split points. You can access this information via the instance attribute est.feature_importances_.


# sort importancesindices = np.argsort(est.feature_importances_)# plot as bar chartplt.barh(np.arange(len(names)), est.feature_importances_[indices])plt.yticks(np.arange(len(names)) + 0.25, np.array(names)[indices])_ = plt.xlabel(‘Relative importance‘)

Partial dependence

Partial dependence plots show the dependence between the response and a set of features, marginalizing over the values of all other features. Intuitively, we can interpret the partial dependence as the expected response as a function of the features we conditioned on.

The plot below contains 4 one-way partial depencence plots (PDP) each showing the effect of an idividual feature on the repsonse. We can see that median incomeMedInc has a linear relationship with the log median house value. The contour plot shows a two-way PDP. Here we can see an interesting feature interaction. It seems that house age itself has hardly an effect on the response but when AveOccup is small it has an effect (the older the house the higher the price).

from sklearn.ensemble.partial_dependence import plot_partial_dependencefeatures = [‘MedInc‘, ‘AveOccup‘, ‘HouseAge‘, ‘AveRooms‘,
            (‘AveOccup‘, ‘HouseAge‘)]fig, axs = plot_partial_dependence(est, X_train, features,
                                   feature_names=names, figsize=(8, 6))Scikit-learn provides a convenience function to create such plots: sklearn.ensemble.partial_dependence.plot_partial_dependence or a low-level function that you can use to create custom partial dependence plots (e.g. map overlays or 3d 
时间: 2024-10-28 16:13:38

Gradient Boosted Regression Trees 2的相关文章

Gradient Boosted Regression

3.2.4.3.6. sklearn.ensemble.GradientBoostingRegressor class sklearn.ensemble.GradientBoostingRegressor(loss='ls', learning_rate=0.1, n_estimators=100, subsample=1.0,min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, i

Opencv2.4.9源码分析——Gradient Boosted Trees

一.原理 梯度提升树(GBT,Gradient Boosted Trees,或称为梯度提升决策树)算法是由Friedman于1999年首次完整的提出,该算法可以实现回归.分类和排序.GBT的优点是特征属性无需进行归一化处理,预测速度快,可以应用不同的损失函数等. 从它的名字就可以看出,GBT包括三个机器学习的优化算法:决策树方法.提升方法和梯度下降法.前两种算法在我以前的文章中都有详细的介绍,在这里我只做简单描述. 决策树是一个由根节点.中间节点.叶节点和分支构成的树状模型,分支代表着数据的走向

Parallel Gradient Boosting Decision Trees

本文转载自:链接 Highlights Three different methods for parallel gradient boosting decision trees. My algorithm and implementation is competitve with (and in many cases better than) the implementation in OpenCV and XGBoost (A parallel GBDT library with 750+

人脸对齐--One Millisecond Face Alignment with an Ensemble of Regression Trees

One Millisecond Face Alignment with an Ensemble of Regression Trees CVPR2014 http://www.csc.kth.se/~vahidk/face_ert.html https://github.com/suzuichi/OneMillisecondFaceAlignment 本文也是使用级联回归器来做人脸特征对齐的.速度快,效果不错,Dlib 实现了代码,可以测试 站在巨人的肩膀上可以看得更远.这里我们借鉴了前人的两个

【Gradient Boosted Decision Tree】林轩田机器学习技术

GBDT之前实习的时候就听说应用很广,现在终于有机会系统的了解一下. 首先对比上节课讲的Random Forest模型,引出AdaBoost-DTree(D) AdaBoost-DTree可以类比AdaBoost-Stump模型,就可以直观理解了 1)每轮都给调整sample的权重 2)获得gt(D,ut) 3)计算gt的投票力度alphat 最后返回一系列gt的线性组合. weighted error这个比较难搞,有没有不用动原来的模型,通过输入数据上做文章就可以达到同样的目的呢? 回想bag

机器学习技法(11)--Gradient Boosted Decision Tree

AdaBoost D Tree有了新的权重的概念. 现在的优化目标,如何进行优化呢? 不更改算法的部门,而想办法在输入的数据方面做修改. 权重的意义就是被重复取到的数据的次数.这样的话,根据权重的比例进行重复的抽样.最后的结果也和之前一样能够表达权重的意义在里面了. 在一个fully grown tree的情况下: 应对办法: 如果剪枝剪到极限的时候: 就是AdaBoost Stump. 在AdaBoost中: 有阴影的部分就是用来投票决定G最终结果的.这个方程式延伸一下: 对他们这样投票的过程

[11-3] Gradient Boosting regression

main idea:用adaboost类似的方法,选出g,然后选出步长 Gredient Boosting for regression: h控制方向,eta控制步长,需要对h的大小进行限制 对(x,残差)解regression,得到h 对(g(x),残差)解regression,得到eta

机器学习算法之旅(转载)

http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/ In this post, we take a tour of the most popular machine learning algorithms. It is useful to tour the main algorithms in the field to get a feeling of what methods are availabl

集成方法:渐进梯度回归树GBRT(迭代决策树)

http://blog.csdn.net/pipisorry/article/details/60776803 单决策树C4.5由于功能太简单,并且非常容易出现过拟合的现象,于是引申出了许多变种决策树,就是将单决策树进行模型组合,形成多决策树,比较典型的就是迭代决策树GBRT和随机森林RF.在最近几年的paper上,如iccv这种重量级会议,iccv 09年的里面有不少文章都是与Boosting和随机森林相关的.模型组合+决策树相关算法有两种比较基本的形式:随机森林RF与GBDT,其他比较新的模