【Spark MLlib速成宝典】模型篇07梯度提升树【Gradient-Boosted Trees】（Python版）

# -*-coding=utf-8 -*-
from pyspark import SparkConf, SparkContext
sc = SparkContext(‘local‘)

from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel
from pyspark.mllib.util import MLUtils

# Load and parse the data file.
data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
‘‘‘
每一行使用以下格式表示一个标记的稀疏特征向量
label index1:value1 index2:value2 ...

tempFile.write(b"+1 1:1.0 3:2.0 5:3.0\\n-1\\n-1 2:4.0 4:5.0 6:6.0")
>>> tempFile.flush()
>>> examples = MLUtils.loadLibSVMFile(sc, tempFile.name).collect()
>>> tempFile.close()
>>> examples[0]
LabeledPoint(1.0, (6,[0,2,4],[1.0,2.0,3.0]))
>>> examples[1]
LabeledPoint(-1.0, (6,[],[]))
>>> examples[2]
LabeledPoint(-1.0, (6,[1,3,5],[4.0,5.0,6.0]))
‘‘‘
# Split the data into training and test sets (30% held out for testing)  分割数据集，留30%作为测试集
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a GradientBoostedTrees model. 训练决策树模型
#  Notes: (a) Empty categoricalFeaturesInfo indicates all features are continuous. 空的categoricalFeaturesInfo意味着所有的特征都是连续的
#         (b) Use more iterations in practice. 在实践中使用更多的迭代步数
model = GradientBoostedTrees.trainClassifier(trainingData,
                                             categoricalFeaturesInfo={}, numIterations=30)

# Evaluate model on test instances and compute test error 评估模型
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(
    lambda lp: lp[0] != lp[1]).count() / float(testData.count())
print(‘Test Error = ‘ + str(testErr)) #Test Error = 0.0
print(‘Learned classification GBT model:‘)
print(model.toDebugString())
‘‘‘
TreeEnsembleModel classifier with 30 trees

  Tree 0:
    If (feature 434 <= 0.0)
     If (feature 100 <= 165.0)
      Predict: -1.0
     Else (feature 100 > 165.0)
      Predict: 1.0
    Else (feature 434 > 0.0)
     Predict: 1.0
  Tree 1:
    If (feature 490 <= 0.0)
     If (feature 549 <= 253.0)
      If (feature 184 <= 0.0)
       Predict: -0.4768116880884702
      Else (feature 184 > 0.0)
       Predict: -0.47681168808847024
     Else (feature 549 > 253.0)
      Predict: 0.4768116880884694
    Else (feature 490 > 0.0)
     If (feature 215 <= 251.0)
      Predict: 0.4768116880884701
     Else (feature 215 > 251.0)
      Predict: 0.4768116880884712
  ...
  Tree 29:
    If (feature 434 <= 0.0)
     If (feature 209 <= 4.0)
      Predict: 0.1335953290513215
     Else (feature 209 > 4.0)
      If (feature 372 <= 84.0)
       Predict: -0.13359532905132146
      Else (feature 372 > 84.0)
       Predict: -0.1335953290513215
    Else (feature 434 > 0.0)
     Predict: 0.13359532905132146
‘‘‘
# Save and load model
model.save(sc, "myGradientBoostingClassificationModel")
sameModel = GradientBoostedTreesModel.load(sc,"myGradientBoostingClassificationModel")
print sameModel.predict(data.collect()[0].features) #0.0

返回目录

时间： 2024-10-04 01:10:52

【Spark MLlib速成宝典】模型篇07梯度提升树【Gradient-Boosted Trees】（Python版）的相关文章

【Spark MLlib速成宝典】模型篇04朴素贝叶斯【Naive Bayes】（Python版）

目录朴素贝叶斯原理朴素贝叶斯代码(Spark Python) 朴素贝叶斯原理详见博文:http://www.cnblogs.com/itmorn/p/7905975.html 返回目录朴素贝叶斯代码(Spark Python) 代码里数据:https://pan.baidu.com/s/1jHWKG4I 密码:acq1 # -*-coding=utf-8 -*- from pyspark import SparkConf, SparkContext sc = SparkContext('

【Spark MLlib速成宝典】模型篇06随机森林【Random Forests】（Python版）

目录随机森林原理随机森林代码(Spark Python) 随机森林原理待续... 返回目录随机森林代码(Spark Python) 代码里数据:https://pan.baidu.com/s/1jHWKG4I 密码:acq1 # -*-coding=utf-8 -*- from pyspark import SparkConf, SparkContext sc = SparkContext('local') from pyspark.mllib.tree import RandomFor

梯度提升树(GBDT)原理小结

在集成学习之Adaboost算法原理小结中,我们对Boosting家族的Adaboost算法做了总结,本文就对Boosting家族中另一个重要的算法梯度提升树(Gradient Boosting Decison Tree, 以下简称GBDT)做一个总结.GBDT有很多简称,有GBT(Gradient Boosting Tree),?GTB(Gradient Tree Boosting?),?GBRT(Gradient Boosting Regression Tree), MART(Multipl

mllib之随机森林与梯度提升树

随机森林和GBTs都是集成学习算法,它们通过集成多棵决策树来实现强分类器. 集成学习方法就是基于其他的机器学习算法,并把它们有效的组合起来的一种机器学习算法.组合产生的算法相比其中任何一种算法模型更强大.准确. 随机森林和梯度提升树(GBTs).两者之间主要差别在于每棵树训练的顺序. 随机森林通过对数据随机采样来单独训练每一棵树.这种随机性也使得模型相对于单决策树更健壮,且不易在训练集上产生过拟合. GBTs则一次只训练一棵树,后面每一棵新的决策树逐步矫正前面决策树产生的误差.随着树的添加,模型

scikit-learn 梯度提升树(GBDT)调参小结

在梯度提升树(GBDT)原理小结中,我们对GBDT的原理做了总结,本文我们就从scikit-learn里GBDT的类库使用方法作一个总结,主要会关注调参中的一些要点. 1. scikit-learn GBDT类库概述在sacikit-learn中,GradientBoostingClassifier为GBDT的分类类, 而GradientBoostingRegressor为GBDT的回归类.两者的参数类型完全相同,当然有些参数比如损失函数loss的可选择项并不相同.这些参数中,类似于Adabo

Kaggle Master解释梯度提升（Gradient Boosting）（译）

如果说线性回归算法像丰田凯美瑞的话,那么梯度提升(GB)方法就像是UH-60黑鹰直升机.XGBoost算法作为GB的一个实现是Kaggle机器学习比赛的常胜将军.不幸的是,很多从业者都只把这个算法当作黑盒使用(包括曾经的我).这篇文章的目的就是直观而全面的介绍经典梯度提升方法的原理. 原理说明我们先从一个简单的例子开始.我们想要基于是否打电子游戏.是否享受园艺以及是否喜欢戴帽子三个特征来预测一个人的年龄.我们的目标函数是最小化平方和,将用于训练我们模型的训练集如下: ID 年龄喜欢园艺

笔记︱决策树族——梯度提升树(GBDT）

笔记︱决策树族--梯度提升树(GBDT) 本笔记来源于CDA DSC,L2-R语言课程所学进行的总结. 一.介绍:梯度提升树(Gradient Boost Decision Tree) Boosting算法和树模型的结合.按次序建立多棵树,每棵树都是为了减少上一次的残差(residual),每个新的模型的建立都是为了使之前模型的残差往梯度方向减少.最后将当前得到的决策树与之前的那些决策树合并起来进行预测. 相比随机森林有更多的参数需要调整. ---------------------------

04-07 scikit-learn库之梯度提升树

目录 scikit-learn库之梯度提升树一.GradietBoostingClassifier 1.1 使用场景 1.2 参数 1.3 属性 1.4 方法二.GradietBoostingClassifier 更新.更全的<机器学习>的更新网站,更有python.go.数据结构与算法.爬虫.人工智能教学等着你:https://www.cnblogs.com/nickchen121/ scikit-learn库之梯度提升树本文主要介绍梯度提升树的两个模型GradientBoosting

04-06 梯度提升树

目录梯度提升树一.梯度提升树学习目标二.梯度提升树详解 2.1 梯度提升树和提升树三.回归梯度提升树流程 3.1 输入 3.2 输出 3.3 流程四.梯度提升树优缺点 4.1 优点 4.2 缺点五.小结更新.更全的<机器学习>的更新网站,更有python.go.数据结构与算法.爬虫.人工智能教学等着你:https://www.cnblogs.com/nickchen121/ 梯度提升树梯度提升树(gradien boosting decision tree,GBDT)在工业上用

【Spark MLlib速成宝典】模型篇07梯度提升树【Gradient-Boosted Trees】（Python版）

目录

梯度提升树原理

梯度提升树代码(Spark Python)