决策树Decision Tree 及实现

Decision Tree 及实现

标签：决策树熵信息增益分类有监督

2014-03-17 12:12 15010人阅读评论(41) 收藏举报

分类：

Data Mining（25） Python（24） Machine Learning（46）

本文基于python逐步实现Decision Tree(决策树)，分为以下几个步骤：

加载数据集
熵的计算
根据最佳分割feature进行数据分割
根据最大信息增益选择最佳分割feature
递归构建决策树
样本分类

关于决策树的理论方面本文几乎不讲，详情请google keywords:“决策树信息增益熵”

将分别体现于代码。

本文只建一个.py文件，所有代码都在这个py里

1.加载数据集

我们选用UCI经典Iris为例

Brief of IRIS:

Data Set Characteristics:	Multivariate	Number of Instances:	150	Area:	Life
Attribute Characteristics:	Real	Number of Attributes:	4	Date Donated	1988-07-01
Associated Tasks:	Classification	Missing Values?	No	Number of Web Hits:	533125

Code：

[python] view plain copy

from numpy import *
#load "iris.data" to workspace
traindata = loadtxt("D:\ZJU_Projects\machine learning\ML_Action\Dataset\Iris.data",delimiter = ‘,‘,usecols = (0,1,2,3),dtype = float)
trainlabel = loadtxt("D:\ZJU_Projects\machine learning\ML_Action\Dataset\Iris.data",delimiter = ‘,‘,usecols = (range(4,5)),dtype = str)
feaname = ["#0","#1","#2","#3"] # feature names of the 4 attributes (features)

Result:

左图为实际数据集，四个离散型feature，一个label表示类别（有Iris-setosa, Iris-versicolor，Iris-virginica 三个类）

2. 熵的计算

entropy是香农提出来的（信息论大牛），定义见wiki

注意这里的entropy是H(C|X=xi)而非H(C|X), H（C|X）的计算见第下一个点，还要乘以概率加和

Code：

[python] view plain copy

from math import log
def calentropy(label):
n = label.size # the number of samples
#print n
count = {} #create dictionary "count"
for curlabel in label:
if curlabel not in count.keys():
count[curlabel] = 0
count[curlabel] += 1
entropy = 0
#print count
for key in count:
pxi = float(count[key])/n #notice transfering to float first
entropy -= pxi*log(pxi,2)
return entropy
#testcode:
#x = calentropy(trainlabel)

Result：

3. 根据最佳分割feature进行数据分割

假定我们已经得到了最佳分割feature，在这里进行分割（最佳feature为splitfea_idx）

第二个函数idx2data是根据splitdata得到的分割数据的两个index集合返回datal (samples less than pivot), datag(samples greater than pivot), labell, labelg。这里我们根据所选特征的平均值作为pivot

[python] view plain copy

#split the dataset according to label "splitfea_idx"
def splitdata(oridata,splitfea_idx):
arg = args[splitfea_idx] #get the average over all dimensions
idx_less = [] #create new list including data with feature less than pivot
idx_greater = [] #includes entries with feature greater than pivot
n = len(oridata)
for idx in range(n):
d = oridata[idx]
if d[splitfea_idx] < arg:
#add the newentry into newdata_less set
idx_less.append(idx)
else:
idx_greater.append(idx)
return idx_less,idx_greater
#testcode:2
#idx_less,idx_greater = splitdata(traindata,2)
#give the data and labels according to index
def idx2data(oridata,label,splitidx,fea_idx):
idxl = splitidx[0] #split_less_indices
idxg = splitidx[1] #split_greater_indices
datal = []
datag = []
labell = []
labelg = []
for i in idxl:
datal.append(append(oridata[i][:fea_idx],oridata[i][fea_idx+1:]))
for i in idxg:
datag.append(append(oridata[i][:fea_idx],oridata[i][fea_idx+1:]))
labell = label[idxl]
labelg = label[idxg]
return datal,datag,labell,labelg

这里args是参数，决定分裂节点的阈值（每个参数对应一个feature，大于该值分到>branch，小于该值分到<branch）,我们可以定义如下：

[python] view plain copy

args = mean(traindata,axis = 0)

测试：按特征2进行分类，得到的less和greater set of indices分别为：

也就是按args[2]进行样本集分割，<和>args[2]的branch分别有57和93个样本。

4. 根据最大信息增益选择最佳分割feature

信息增益为代码中的info_gain, 注释中是熵的计算

[python] view plain copy

#select the best branch to split
def choosebest_splitnode(oridata,label):
n_fea = len(oridata[0])
n = len(label)
base_entropy = calentropy(label)
best_gain = -1
for fea_i in range(n_fea): #calculate entropy under each splitting feature
cur_entropy = 0
idxset_less,idxset_greater = splitdata(oridata,fea_i)
prob_less = float(len(idxset_less))/n
prob_greater = float(len(idxset_greater))/n
#entropy(value|X) = \sum{p(xi)*entropy(value|X=xi)}
cur_entropy += prob_less*calentropy(label[idxset_less])
cur_entropy += prob_greater * calentropy(label[idxset_greater])
info_gain = base_entropy - cur_entropy #notice gain is before minus after
if(info_gain>best_gain):
best_gain = info_gain
best_idx = fea_i
return best_idx
#testcode:
#x = choosebest_splitnode(traindata,trainlabel)

这里的测试针对所有数据，分裂一次选择哪个特征呢？

5. 递归构建决策树

详见code注释，buildtree递归地构建树。

递归终止条件：

①该branch内没有样本（subset为空） or

②分割出的所有样本属于同一类 or

③由于每次分割消耗一个feature，当没有feature的时候停止递归，返回当前样本集中大多数sample的label

[python] view plain copy

#create the decision tree based on information gain
def buildtree(oridata, label):
if label.size==0: #if no samples belong to this branch
return "NULL"
listlabel = label.tolist()
#stop when all samples in this subset belongs to one class
if listlabel.count(label[0])==label.size:
return label[0]
#return the majority of samples‘ label in this subset if no extra features avaliable
if len(feanamecopy)==0:
cnt = {}
for cur_l in label:
if cur_l not in cnt.keys():
cnt[cur_l] = 0
cnt[cur_l] += 1
maxx = -1
for keys in cnt:
if maxx < cnt[keys]:
maxx = cnt[keys]
maxkey = keys
return maxkey
bestsplit_fea = choosebest_splitnode(oridata,label) #get the best splitting feature
print bestsplit_fea,len(oridata[0])
cur_feaname = feanamecopy[bestsplit_fea] # add the feature name to dictionary
print cur_feaname
nodedict = {cur_feaname:{}}
del(feanamecopy[bestsplit_fea]) #delete current feature from feaname
split_idx = splitdata(oridata,bestsplit_fea) #split_idx: the split index for both less and greater
data_less,data_greater,label_less,label_greater = idx2data(oridata,label,split_idx,bestsplit_fea)
#build the tree recursively, the left and right tree are the "<" and ">" branch, respectively
nodedict[cur_feaname]["<"] = buildtree(data_less,label_less)
nodedict[cur_feaname][">"] = buildtree(data_greater,label_greater)
return nodedict
#testcode:
#mytree = buildtree(traindata,trainlabel)
#print mytree

Result:

mytree就是我们的结果，#1表示当前使用第一个feature做分割，‘<‘和‘>‘分别对应less 和 greater的数据。

6. 样本分类

根据构建出的mytree进行分类，递归走分支

[python] view plain copy

#classify a new sample
def classify(mytree,testdata):
if type(mytree).__name__ != ‘dict‘:
return mytree
fea_name = mytree.keys()[0] #get the name of first feature
fea_idx = feaname.index(fea_name) #the index of feature ‘fea_name‘
val = testdata[fea_idx]
nextbranch = mytree[fea_name]
#judge the current value > or < the pivot (average)
if val>args[fea_idx]:
nextbranch = nextbranch[">"]
else:
nextbranch = nextbranch["<"]
return classify(nextbranch,testdata)
#testcode
tt = traindata[0]
x = classify(mytree,tt)
print x

Result：

为了验证代码准确性，我们换一下args参数，把它们都设成0（很小）

args = [0,0,0,0]

建树和分类的结果如下：

可见没有小于pivot(0)的项，于是dict中每个<的key对应的value都为空。

本文中全部代码下载：决策树python实现

Reference: Machine Learning in Action

from: http://blog.csdn.net/abcjennifer/article/details/20905311

时间： 2024-11-07 05:15:28

决策树Decision Tree 及实现的相关文章

转载：算法杂货铺——分类算法之决策树(Decision tree)

作者:张洋算法杂货铺——分类算法之决策树(Decision tree) 2010-09-19 16:30 by T2噬菌体, 44346 阅读, 29 评论, 收藏, 编辑 3.1.摘要在前面两篇文章中,分别介绍和讨论了朴素贝叶斯分类与贝叶斯网络两种分类算法.这两种算法都以贝叶斯定理为基础,可以对分类及决策问题进行概率推断.在这一篇文章中,将讨论另一种被广泛使用的分类算法——决策树(decision tree).相比贝叶斯算法,决策树的优势在于构造过程不需要任何领域知识或参数设置,因此在实际

【机器学习算法-python实现】决策树-Decision tree（1）信息熵划分数据集

(转载请注明出处:http://blog.csdn.net/buptgshengod) 1.背景决策书算法是一种逼近离散数值的分类算法,思路比較简单,并且准确率较高.国际权威的学术组织,数据挖掘国际会议ICDM (the IEEE International Conference on Data Mining)在2006年12月评选出了数据挖掘领域的十大经典算法中,C4.5算法排名第一.C4.5算法是机器学习算法中的一种分类决策树算法,其核心算法是ID3算法. 算法的主要思想就是将数据集依照特

【机器学习算法-python实现】决策树-Decision tree（2）决策树的实现

(转载请注明出处:http://blog.csdn.net/buptgshengod) 1.背景接着上一节说,没看到请先看一下上一节关于数据集的划分数据集划分.如今我们得到了每一个特征值得信息熵增益,我们依照信息熵增益的从大到校的顺序,安排排列为二叉树的节点.数据集和二叉树的图见下. (二叉树的图是用python的matplotlib库画出来的) 数据集: 决策树: 2.代码实现部分由于上一节,我们通过chooseBestFeatureToSplit函数已经能够确定当前数据集中的信息熵最大的

机器学习算法实践：决策树 (Decision Tree)（转载）

前言最近打算系统学习下机器学习的基础算法,避免眼高手低,决定把常用的机器学习基础算法都实现一遍以便加深印象.本文为这系列博客的第一篇,关于决策树(Decision Tree)的算法实现,文中我将对决策树种涉及到的算法进行总结并附上自己相关的实现代码.所有算法代码以及用于相应模型的训练的数据都会放到GitHub上(https://github.com/PytLab/MLBox). 本文中我将一步步通过MLiA的隐形眼镜处方数集构建决策树并使用Graphviz将决策树可视化. 决策树学习决策树

机器学习入门 - 1. 介绍与决策树(decision tree)

机器学习(Machine Learning) 介绍与决策树(Decision Tree) 机器学习入门系列是个人学习过程中的一些记录与心得.其主要以要点形式呈现,简洁明了. 1.什么是机器学习? 一个比较概括的理解是: 根据现有的数据,预测未来 2.核心思想 : Generalization 可以理解为,归纳.概括.就像是人的学习一样,找出一件事物与与一件事物的联系 3.归纳性的机器学习(Inductive machine learning) 其核心思想是使用训练数据,并从其中摸索出一套适用

机器学习(二)之决策树(Decision Tree)

Contents 理论基础熵信息增益算法实现 Python 模型的保存与读取总结理论基础决策树(Decision Tree, DT):决策树是一种基本的分类与回归方法.由于模型呈树形结构,可以看做是if-then规则的集合,具有一定的可读性,可视化效果好. 决策树的建立包括3个步骤:特征选择.决策树生成和决策树的修剪. 模型的建立实际上就是通过某种方式,递归地选择最优的特征,并通过数据的划分,将无序的数据变得有序. 因此,在构造决策树时,第一个需要解决的问题就是如何确定出哪个特征在划

数据挖掘-决策树 Decision tree

数据挖掘-决策树 Decision tree 目录数据挖掘-决策树 Decision tree 1. 决策树概述 1.1 决策树介绍 1.1.1 决策树定义 1.1.2 本质 1.1.3 决策树的组成 1.1.4 决策树的分类 1.1.5 决策过程 1.2 决策树的优化 1.2.1 过拟合 1.3.1 剪枝 2. 理论基础 2.1 香农理论 2.1.1 信息量 2.1.2 平均信息量/信息熵 2.1.3 条件熵 2.1.4 信息增益(Information gain) 2.1.5 信息增益率

机器学习-决策树 Decision Tree

咱们正式进入了机器学习的模型的部分,虽然现在最火的的机器学习方面的库是Tensorflow, 但是这里还是先简单介绍一下另一个数据处理方面很火的库叫做sklearn.其实咱们在前面已经介绍了一点点sklearn,主要是在categorical data encoding那一块.其实sklearn在数据建模方面也是非常666的.一般常用的模型都可以用sklearn来做的.既然它都这么牛逼了,咱们为啥还要学TensorFlow呢?其实主要的原因有两个,一是因为Google在流量方面的强势推广,导致绝

Spark上的决策树(Decision Tree On Spark)

最近花了一些时间学习了Scala和Spark,学习语言和框架这样的东西,除了自己敲代码折腾和玩弄外,另一个行之有效的方法就是阅读代码.MLlib正好是以Spark为基础的开源机器学习库,便借机学习MLlib是如何利用Spark实现分布式决策树.本文主要是剖析MLlib的DecisionTree源码,假设读者已经入门Scala基本语法,并熟悉决策树的基本概念,假如您不清楚,可以参照Coursera上两门课程,一门是Scala之父Martin Odersky的<Functional Programm