ID3决策树算法实现（Python版）

  1 # -*- coding:utf-8 -*-
  2
  3 from numpy import *
  4 import numpy as np
  5 import pandas as pd
  6 from math import log
  7 import operator
  8
  9 #计算数据集的香农熵
 10 def calcShannonEnt(dataSet):
 11     numEntries=len(dataSet)
 12     labelCounts={}
 13     #给所有可能分类创建字典
 14     for featVec in dataSet:
 15         currentLabel=featVec[-1]
 16         if currentLabel not in labelCounts.keys():
 17             labelCounts[currentLabel]=0
 18         labelCounts[currentLabel]+=1
 19     shannonEnt=0.0
 20     #以2为底数计算香农熵
 21     for key in labelCounts:
 22         prob = float(labelCounts[key])/numEntries
 23         shannonEnt-=prob*log(prob,2)
 24     return shannonEnt
 25
 26
 27 #对离散变量划分数据集，取出该特征取值为value的所有样本
 28 def splitDataSet(dataSet,axis,value):
 29     retDataSet=[]
 30     for featVec in dataSet:
 31         if featVec[axis]==value:
 32             reducedFeatVec=featVec[:axis]
 33             reducedFeatVec.extend(featVec[axis+1:])
 34             retDataSet.append(reducedFeatVec)
 35     return retDataSet
 36
 37 #对连续变量划分数据集，direction规定划分的方向，
 38 #决定是划分出小于value的数据样本还是大于value的数据样本集
 39 def splitContinuousDataSet(dataSet,axis,value,direction):
 40     retDataSet=[]
 41     for featVec in dataSet:
 42         if direction==0:
 43             if featVec[axis]>value:
 44                 reducedFeatVec=featVec[:axis]
 45                 reducedFeatVec.extend(featVec[axis+1:])
 46                 retDataSet.append(reducedFeatVec)
 47         else:
 48             if featVec[axis]<=value:
 49                 reducedFeatVec=featVec[:axis]
 50                 reducedFeatVec.extend(featVec[axis+1:])
 51                 retDataSet.append(reducedFeatVec)
 52     return retDataSet
 53
 54 #选择最好的数据集划分方式
 55 def chooseBestFeatureToSplit(dataSet,labels):
 56     numFeatures=len(dataSet[0])-1
 57     baseEntropy=calcShannonEnt(dataSet)
 58     bestInfoGain=0.0
 59     bestFeature=-1
 60     bestSplitDict={}
 61     for i in range(numFeatures):
 62         featList=[example[i] for example in dataSet]
 63         #对连续型特征进行处理
 64         if type(featList[0]).__name__==‘float‘ or type(featList[0]).__name__==‘int‘:
 65             #产生n-1个候选划分点
 66             sortfeatList=sorted(featList)
 67             splitList=[]
 68             for j in range(len(sortfeatList)-1):
 69                 splitList.append((sortfeatList[j]+sortfeatList[j+1])/2.0)
 70
 71             bestSplitEntropy=10000
 72             slen=len(splitList)
 73             #求用第j个候选划分点划分时，得到的信息熵，并记录最佳划分点
 74             for j in range(slen):
 75                 value=splitList[j]
 76                 newEntropy=0.0
 77                 subDataSet0=splitContinuousDataSet(dataSet,i,value,0)
 78                 subDataSet1=splitContinuousDataSet(dataSet,i,value,1)
 79                 prob0=len(subDataSet0)/float(len(dataSet))
 80                 newEntropy+=prob0*calcShannonEnt(subDataSet0)
 81                 prob1=len(subDataSet1)/float(len(dataSet))
 82                 newEntropy+=prob1*calcShannonEnt(subDataSet1)
 83                 if newEntropy<bestSplitEntropy:
 84                     bestSplitEntropy=newEntropy
 85                     bestSplit=j
 86             #用字典记录当前特征的最佳划分点
 87             bestSplitDict[labels[i]]=splitList[bestSplit]
 88             infoGain=baseEntropy-bestSplitEntropy
 89         #对离散型特征进行处理
 90         else:
 91             uniqueVals=set(featList)
 92             newEntropy=0.0
 93             #计算该特征下每种划分的信息熵
 94             for value in uniqueVals:
 95                 subDataSet=splitDataSet(dataSet,i,value)
 96                 prob=len(subDataSet)/float(len(dataSet))
 97                 newEntropy+=prob*calcShannonEnt(subDataSet)
 98             infoGain=baseEntropy-newEntropy
 99         if infoGain>bestInfoGain:
100             bestInfoGain=infoGain
101             bestFeature=i
102     #若当前节点的最佳划分特征为连续特征，则将其以之前记录的划分点为界进行二值化处理
103     #即是否小于等于bestSplitValue
104     if type(dataSet[0][bestFeature]).__name__==‘float‘ or type(dataSet[0][bestFeature]).__name__==‘int‘:
105         bestSplitValue=bestSplitDict[labels[bestFeature]]
106         labels[bestFeature]=labels[bestFeature]+‘<=‘+str(bestSplitValue)
107         for i in range(shape(dataSet)[0]):
108             if dataSet[i][bestFeature]<=bestSplitValue:
109                 dataSet[i][bestFeature]=1
110             else:
111                 dataSet[i][bestFeature]=0
112     return bestFeature
113
114 #特征若已经划分完，节点下的样本还没有统一取值，则需要进行投票
115 def majorityCnt(classList):
116     classCount={}
117     for vote in classList:
118         if vote not in classCount.keys():
119             classCount[vote]=0
120         classCount[vote]+=1
121     return max(classCount)
122
123 #主程序，递归产生决策树
124 def createTree(dataSet,labels,data_full,labels_full):
125     classList=[example[-1] for example in dataSet]
126     if classList.count(classList[0])==len(classList):
127         return classList[0]
128     if len(dataSet[0])==1:
129         return majorityCnt(classList)
130     bestFeat=chooseBestFeatureToSplit(dataSet,labels)
131     bestFeatLabel=labels[bestFeat]
132     myTree={bestFeatLabel:{}}
133     featValues=[example[bestFeat] for example in dataSet]
134     uniqueVals=set(featValues)
135     if type(dataSet[0][bestFeat]).__name__==‘str‘:
136         currentlabel=labels_full.index(labels[bestFeat])
137         featValuesFull=[example[currentlabel] for example in data_full]
138         uniqueValsFull=set(featValuesFull)
139     del(labels[bestFeat])
140     #针对bestFeat的每个取值，划分出一个子树。
141     for value in uniqueVals:
142         subLabels=labels[:]
143         if type(dataSet[0][bestFeat]).__name__==‘str‘:
144             uniqueValsFull.remove(value)
145         myTree[bestFeatLabel][value]=createTree(splitDataSet146          (dataSet,bestFeat,value),subLabels,data_full,labels_full)
147     if type(dataSet[0][bestFeat]).__name__==‘str‘:
148         for value in uniqueValsFull:
149             myTree[bestFeatLabel][value]=majorityCnt(classList)
150     return myTree
151
152 import matplotlib.pyplot as plt
153 decisionNode=dict(boxstyle="sawtooth",fc="0.8")
154 leafNode=dict(boxstyle="round4",fc="0.8")
155 arrow_args=dict(arrowstyle="<-")
156
157
158 #计算树的叶子节点数量
159 def getNumLeafs(myTree):
160     numLeafs=0
161     firstSides = list(myTree.keys())
162     firstStr=firstSides[0]
163     secondDict=myTree[firstStr]
164     for key in secondDict.keys():
165         if type(secondDict[key]).__name__==‘dict‘:
166             numLeafs+=getNumLeafs(secondDict[key])
167         else: numLeafs+=1
168     return numLeafs
169
170 #计算树的最大深度
171 def getTreeDepth(myTree):
172     maxDepth=0
173     firstSides = list(myTree.keys())
174     firstStr=firstSides[0]
175     secondDict=myTree[firstStr]
176     for key in secondDict.keys():
177         if type(secondDict[key]).__name__==‘dict‘:
178             thisDepth=1+getTreeDepth(secondDict[key])
179         else: thisDepth=1
180         if thisDepth>maxDepth:
181             maxDepth=thisDepth
182     return maxDepth
183
184 #画节点
185 def plotNode(nodeTxt,centerPt,parentPt,nodeType):
186     createPlot.ax1.annotate(nodeTxt,xy=parentPt,xycoords=‘axes fraction‘,187     xytext=centerPt,textcoords=‘axes fraction‘,va="center", ha="center",188     bbox=nodeType,arrowprops=arrow_args)
189
190 #画箭头上的文字
191 def plotMidText(cntrPt,parentPt,txtString):
192     lens=len(txtString)
193     xMid=(parentPt[0]+cntrPt[0])/2.0-lens*0.002
194     yMid=(parentPt[1]+cntrPt[1])/2.0
195     createPlot.ax1.text(xMid,yMid,txtString)
196
197 def plotTree(myTree,parentPt,nodeTxt):
198     numLeafs=getNumLeafs(myTree)
199     depth=getTreeDepth(myTree)
200     firstSides = list(myTree.keys())
201     firstStr=firstSides[0]
202     cntrPt=(plotTree.x0ff+(1.0+float(numLeafs))/2.0/plotTree.totalW,plotTree.y0ff)
203     plotMidText(cntrPt,parentPt,nodeTxt)
204     plotNode(firstStr,cntrPt,parentPt,decisionNode)
205     secondDict=myTree[firstStr]
206     plotTree.y0ff=plotTree.y0ff-1.0/plotTree.totalD
207     for key in secondDict.keys():
208         if type(secondDict[key]).__name__==‘dict‘:
209             plotTree(secondDict[key],cntrPt,str(key))
210         else:
211             plotTree.x0ff=plotTree.x0ff+1.0/plotTree.totalW
212             plotNode(secondDict[key],(plotTree.x0ff,plotTree.y0ff),cntrPt,leafNode)
213             plotMidText((plotTree.x0ff,plotTree.y0ff),cntrPt,str(key))
214     plotTree.y0ff=plotTree.y0ff+1.0/plotTree.totalD
215
216 def createPlot(inTree):
217     fig=plt.figure(1,facecolor=‘white‘)
218     fig.clf()
219     axprops=dict(xticks=[],yticks=[])
220     createPlot.ax1=plt.subplot(111,frameon=False,**axprops)
221     plotTree.totalW=float(getNumLeafs(inTree))
222     plotTree.totalD=float(getTreeDepth(inTree))
223     plotTree.x0ff=-0.5/plotTree.totalW
224     plotTree.y0ff=1.0
225     plotTree(inTree,(0.5,1.0),‘‘)
226     plt.show()
227
228 df=pd.read_csv(‘watermelon_4_3.csv‘)
229 data=df.values[:,1:].tolist()
230 data_full=data[:]
231 labels=df.columns.values[1:-1].tolist()
232 labels_full=labels[:]
233 myTree=createTree(data,labels,data_full,labels_full)
234 print(myTree)
235 createPlot(myTree)

最终结果如下：

{‘texture‘: {‘blur‘: 0, ‘little_blur‘: {‘touch‘: {‘soft_stick‘: 1, ‘hard_smooth‘: 0}}, ‘distinct‘: {‘density<=0.38149999999999995‘: {0: 1, 1: 0}}}}

得到的决策树如下：

参考资料：

《机器学习实战》

《机器学习》周志华著

时间： 2024-10-30 01:44:49

ID3决策树算法实现（Python版）的相关文章

day-8 python自带库实现ID3决策树算法

前一天,我们基于sklearn科学库实现了ID3的决策树程序,本文将基于python自带库实现ID3决策树算法. 一.代码涉及基本知识 1. 为了绘图方便,引入了一个第三方treePlotter模块进行图形绘制.该模块使用方法简单,调用模块createPlot接口,传入一个树型结构对象,即可绘制出相应图像. 2. 在python中,如何定义一个树型结构对象可以使用了python自带的字典数据类型来定义一个树型对象.例如下面代码,我们定义一个根节点和两个左右子节点: rootNode = {'

ID3决策树算法原理及C++实现(其中代码转自别人的博客)

分类是数据挖掘中十分重要的组成部分. 分类作为一种无监督学习方式被广泛的使用. 之前关于"数据挖掘中十大经典算法"中,基于ID3核心思想的分类算法 C4.5榜上有名.所以不难看出ID3在数据分类中是多么的重要了. ID3又称为决策树算法,虽然现在广义的决策树算法不止ID3一种,但是由于ID3的重要性,习惯是还是把ID3和决策树算法等价起来. 另外无监督学习方式我还要多说两句.无监督学习方式包括决策树算法, 基于规则的分类,神经网络等.这些分类方式是初始分类已知,将样本分为训练样本和

《机器学习实战》基于信息论的三种决策树算法(ID3,C4.5,CART)

============================================================================================ <机器学习实战>系列博客是博主阅读<机器学习实战>这本书的笔记,包含对其中算法的理解和算法的Python代码实现另外博主这里有机器学习实战这本书的所有算法源代码和算法所用到的源文件,有需要的留言如需转载请注明出处,谢谢 ======================================

数据挖掘之决策树算法ID3算法的相关原理

ID3决策树:针对属性选择问题,是决策树算法中最为典型和最具影响力的决策树算法. ID3决策树算法使用信息增益度作为选择测试属性. 其中p(ai) 表示ai 发生的概率. 假设有n个互不相容的事件a1,a2,a3,-.,an,它们中有且仅有一个发生,则其平均的信息量可如下度量: 对数底数可以为任何数,不同的取值对应了熵的不同单位. 通常取2,并规定当p(ai)=0时 =0 Entropy(S,A)=∑(|Sv|/|S|)* Entropy(Sv)公式2 以去不去打羽毛球为例子 A:属性:out

Python机器学习（三）--决策树算法

一.决策树原理决策树是用样本的属性作为结点,用属性的取值作为分支的树结构. 决策树的根结点是所有样本中信息量最大的属性.树的中间结点是该结点为根的子树所包含的样本子集中信息量最大的属性.决策树的叶结点是样本的类别值.决策树是一种知识表示形式,它是对所有样本数据的高度概括决策树能准确地识别所有样本的类别,也能有效地识别新样本的类别. 决策树算法ID3的基本思想: 首先找出最有判别力的属性,把样例分成多个子集,每个子集又选择最有判别力的属性进行划分,一直进行到所有子集仅包含同一类型的数据为止.最后

数据挖掘决策树算法 ID3 通俗演绎

决策树是对数据进行分类,以此达到预测的目的.该决策树方法先根据训练集数据形成决策树,如果该树不能对所有对象给出正确的分类,那么选择一些例外加入到训练集数据中,重复该过程一直到形成正确的决策集.决策树代表着决策集的树形结构. 决策树由决策结点.分支和叶子组成.决策树中最上面的结点为根结点,每个分支是一个新的决策结点,或者是树的叶子.每个决策结点代表一个问题或决策,通常对应于待分类对象的属性.每一个叶子结点代表一种可能的分类结果.沿决策树从上到下遍历的过程中,在每个结点都会遇到一个测试,对每个结点上

吴裕雄--天生自然python机器学习：决策树算法

我们经常使用决策树处理分类问题’近来的调查表明决策树也是最经常使用的数据挖掘算法. 它之所以如此流行,一个很重要的原因就是使用者基本上不用了解机器学习算法,也不用深究它是如何工作的. K-近邻算法可以完成很多分类任务,但是它最大的缺点就是无法给出数据的内在含义,决策树的主要优势就在于数据形式非常容易理解. 决策树很多任务都是为了数据中所蕴含的知识信息,因此决策树可以使用不熟悉的数据集合,并从中提取出一系列规则,机器学习算法最终将使用这些机器从数据集中创造的规则.专家系统中经常使用决策树,

决策树算法原理及实现

(一)认识决策树 1.决策树分类原理决策树是通过一系列规则对数据进行分类的过程.它提供一种在什么条件下会得到什么值的类似规则的方法.决策树分为分类树和回归树两种,分类树对离散变量做决策树,回归树对连续变量做决策树. 近来的调查表明决策树也是最经常使用的数据挖掘算法,它的概念非常简单.决策树算法之所以如此流行,一个很重要的原因就是使用者基本上不用了解机器学习算法,也不用深究它是如何工作的.直观看上去,决策树分类器就像判断模块和终止块组成的流程图,终止块表示分类结果(也就是树的叶子).判断模块表示

scikit-learn决策树算法类库使用小结

参考:http://www.cnblogs.com/pinard/p/6056319.html 之前对决策树的算法原理做了总结,包括决策树算法原理(上)和决策树算法原理(下).今天就从实践的角度来介绍决策树算法,主要是讲解使用scikit-learn来跑决策树算法,结果的可视化以及一些参数调参的关键点. 1. scikit-learn决策树算法类库介绍 scikit-learn决策树算法类库内部实现是使用了调优过的CART树算法,既可以做分类,又可以做回归.分类决策树的类对应的是Decision