机器学习实战读书笔记(三)决策树

3.1 决策树的构造

优点:计算复杂度不高,输出结果易于理解,对中间值的缺失不敏感,可以处理不相关特征数据.

缺点:可能会产生过度匹配问题.

适用数据类型:数值型和标称型.

一般流程:

1.收集数据

2.准备数据

3.分析数据

4.训练算法

5.测试算法

6.使用算法

3.1.1 信息增益

创建数据集

def createDataSet():
    dataSet = [[1, 1, ‘yes‘],
               [1, 1, ‘yes‘],
               [1, 0, ‘no‘],
               [0, 1, ‘no‘],
               [0, 1, ‘no‘]]
    labels = [‘no surfacing‘,‘flippers‘]
    #change to discrete values
    return dataSet, labels

调用一下

myDat,labels=tree.createDataSet()

计算给定数据集的香农熵

def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet: #the the number of unique elements and their occurance
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob,2) #log base 2
    return shannonEnt

3.1.2 划分数据集

def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

选择最好的数据集划分方式

def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):        #iterate over all the features
        featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
        uniqueVals = set(featList)       #get a set of unique values
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
        if (infoGain > bestInfoGain):       #compare this to the best gain so far
            bestInfoGain = infoGain         #if better than current best, set to best
            bestFeature = i
    return bestFeature                      #returns an integer

3.1.3 递归构建决策树

工作原理:得到原始数据集,基于最好的属性值划分数据集,由于特征值可能多于两个,因此可能存在大于两个分支的数据集划分.第一次划分后,数据将被向下传递到树分支的下一个节点,在这个节点上,我们可以再次划分数据.因此可以采用递归的原则处理数据集.

递归结束的条件是:程序遍历完所有划分数据集的属性,或者每个分支下的所有实例都具有相同的分类.如果所有实例具有相同的分类,则得到一个叶子节点或者终止块.任何到达叶子节点的数据必然属于叶子节点的分类.

def majorityCnt(classList):
    classCount={}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

def createTree(dataSet,labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList):
        return classList[0]#stop splitting when all of the classes are equal
    if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]       #copy all of labels, so trees don‘t mess up existing labels
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
    return myTree

调用一下

myTree=tree.createTree(myDat,labels)

结果如下:

{‘no surfacing‘: {0: ‘no‘, 1: {‘flippers‘: {0: ‘no‘, 1: ‘yes‘}}}}

3.2 在Python中使用Matplotlib注解缓制树形图

第一个版本

import matplotlib.pyplot as plt

decisionNode = dict(boxstyle="sawtooth", fc="0.8")
leafNode = dict(boxstyle="round4", fc="0.8")
arrow_args = dict(arrowstyle="<-")

def plotNode(nodeTxt, centerPt, parentPt, nodeType):
    createPlot.ax1.annotate(nodeTxt, xy=parentPt,  xycoords=‘axes fraction‘,
             xytext=centerPt, textcoords=‘axes fraction‘,
             va="center", ha="center", bbox=nodeType, arrowprops=arrow_args )
def createPlot():
    fig = plt.figure(1, facecolor=‘white‘)
    fig.clf()
    createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses
    plotNode(‘a decision node‘, (0.5, 0.1), (0.1, 0.5), decisionNode)
    plotNode(‘a leaf node‘, (0.8, 0.1), (0.3, 0.8), leafNode)
    plt.show()

是这样的图

重新来一个版本

def getNumLeafs(myTree):
    numLeafs = 0
    firstStr = myTree.keys()[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__==‘dict‘:#test to see if the nodes are dictonaires, if not they are leaf nodes
            numLeafs += getNumLeafs(secondDict[key])
        else:   numLeafs +=1
    return numLeafs

def getTreeDepth(myTree):
    maxDepth = 0
    firstStr = myTree.keys()[0]
    secondDict = myTree[firstStr]
    for key in secondDict.keys():
        if type(secondDict[key]).__name__==‘dict‘:#test to see if the nodes are dictonaires, if not they are leaf nodes
            thisDepth = 1 + getTreeDepth(secondDict[key])
        else:   thisDepth = 1
        if thisDepth > maxDepth: maxDepth = thisDepth
    return maxDepth

def retrieveTree(i):
    listOfTrees =[{‘no surfacing‘: {0: ‘no‘, 1: {‘flippers‘: {0: ‘no‘, 1: ‘yes‘}}}},
                  {‘no surfacing‘: {0: ‘no‘, 1: {‘flippers‘: {0: {‘head‘: {0: ‘no‘, 1: ‘yes‘}}, 1: ‘no‘}}}}
                  ]
    return listOfTrees[i]

def plotMidText(cntrPt, parentPt, txtString):
    xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]
    yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
    createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)

def plotTree(myTree, parentPt, nodeTxt):#if the first key tells you what feat was split on
    numLeafs = getNumLeafs(myTree)  #this determines the x width of this tree
    depth = getTreeDepth(myTree)
    firstStr = myTree.keys()[0]     #the text label for this node should be this
    cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)
    plotMidText(cntrPt, parentPt, nodeTxt)
    plotNode(firstStr, cntrPt, parentPt, decisionNode)
    secondDict = myTree[firstStr]
    plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
    for key in secondDict.keys():
        if type(secondDict[key]).__name__==‘dict‘:#test to see if the nodes are dictonaires, if not they are leaf nodes
            plotTree(secondDict[key],cntrPt,str(key))        #recursion
        else:   #it‘s a leaf node print the leaf node
            plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
            plotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))
    plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD
#if you do get a dictonary you know it‘s a tree, and the first element will be another dict

新版本

def createPlot(inTree):
    fig = plt.figure(1, facecolor=‘white‘)
    fig.clf()
    axprops = dict(xticks=[], yticks=[])
    createPlot.ax1 = plt.subplot(111, frameon=False, **axprops)    #no ticks
    #createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses
    plotTree.totalW = float(getNumLeafs(inTree))
    plotTree.totalD = float(getTreeDepth(inTree))
    plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;
    plotTree(inTree, (0.5,1.0), ‘‘)
    plt.show()

把树改一下

myTree[‘no surfacing‘][3]=‘maybe‘

重新画,变成这样

3.3 测试和存储分类器

3.3.1 测试算法:使用决策树执行分类

def classify(inputTree,featLabels,testVec):
    firstStr = inputTree.keys()[0]
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstStr)
    key = testVec[featIndex]
    valueOfFeat = secondDict[key]
    if isinstance(valueOfFeat, dict):
        classLabel = classify(valueOfFeat, featLabels, testVec)
    else: classLabel = valueOfFeat
    return classLabel

3.3.2 使用算法:决策树的存储

def storeTree(inputTree,filename):
    import pickle
    fw = open(filename,‘w‘)
    pickle.dump(inputTree,fw)
    fw.close()

def grabTree(filename):
    import pickle
    fr = open(filename)
    return pickle.load(fr)

3.4 使用决策树预测隐形眼镜类型

fr=open(‘G:\学习\机器学习实战\MLiA_SourceCode\machinelearninginaction\Ch03\lenses.txt‘)
lenses=[inst.strip().split(‘\t‘) for inst in fr.readlines()]
lensesLabels=[‘age‘,‘prescript‘,‘astigmatic‘,‘tearRate‘]
lensesTree=tree.createTree(lenses,lensesLabels)
treeplot.createPlot(lensesTree)

时间： 2024-10-13 08:16:29

机器学习实战读书笔记(三)决策树的相关文章

机器学习实战读书笔记(二)k-近邻算法

knn算法: 1.优点:精度高.对异常值不敏感.无数据输入假定 2.缺点:计算复杂度高.空间复杂度高. 3.适用数据范围:数值型和标称型. 一般流程: 1.收集数据 2.准备数据 3.分析数据 4.训练算法:不适用 5.测试算法:计算正确率 6.使用算法:需要输入样本和结构化的输出结果,然后运行k-近邻算法判定输入数据分别属于哪个分类,最后应用对计算出的分类执行后续的处理. 2.1.1 导入数据 operator是排序时要用的 from numpy import * import operato

机器学习实战读书笔记(五)Logistic回归

Logistic回归的一般过程 1.收集数据:采用任意方法收集 2.准备数据:由于需要进行距离计算,因此要求数据类型为数值型.另外,结构化数据格式则最佳 3.分析数据:采用任意方法对数据进行分析 4.训练算法:大部分时间将用于训练,训练的目的是为了找到最佳的分类回归系数 5.测试算法:一旦训练步骤完成,分类将会很快. 6.使用算法:首先,我们需要输入一些数据,并将其转换成对应的结构化数值:接着,基于训练好的回归系数就可以对这些数值进行简单回归计算,判定它们属于哪个类别:在这之后,我们就可以在输

机器学习实战读书笔记(四)基于概率论的分类方法：朴素贝叶斯

4.1 基于贝叶斯决策理论的分类方法朴素贝叶斯优点:在数据较少的情况下仍然有效,可以处理多类别问题缺点:对于输入数据的准备方式较为敏感适用数据类型:标称型数据贝叶斯决策理论的核心思想:选择具有最高概率的决策. 4.2 条件概率 4.3 使用条件概率来分类 4.4 使用朴素贝叶斯进行文档分类朴素贝叶斯的一般过程: 1.收集数据 2.准备数据 3.分析数据 4.训练算法 5.测试算法 6.使用算法朴素贝叶斯分类器中的另一个假设是,每个特征同等重要. 4.5 使用Python进行文本分类

机器学习实战读书笔记(一)机器学习基础

http://sourceforge.net/projects/numpy/files/ 下载对应版本的numpy,到处下不到,找到一个没python2.7 用pip吧, pip install numpy 下载完毕,提示没装C++,意思是还要装VS2008,但装的是VS2012,只好去下载一个VC for python http://www.microsoft.com/en-us/download/confirmation.aspx?id=44266 重新pip,等了大半天,终算成功了输入命

《R实战》读书笔记三

第二章创建数据集本章概要 1探索R数据结构 2使用数据编辑器 3数据导入 4数据集标注本章所介绍内容概括如下. 两个方面的内容. 方面一:R数据结构方面二:进入数据或者导入数据到数据结构理解数据集一个数据集通常由一个表格组合而成,行表示观测,列表示变量.病人的数据集如表1所示. 表1 病人数据集数据集能够反映数据结构.数据类型和内容. 数据结构 R数据结构如图2所示. 图2:R数据结构数据结构即数据的组织方式,R数据结构包括向量.矩阵.数组.数据框和列表等. R向量 R向量是一

R实战读书笔记四

第三章图形入门本章概要 1 创建和保存图形 2 定义符号.线.颜色和坐标轴 3 文本标注 4 掌控图形维数 5 多幅图合在一起本章所介绍内容概括如下. 一图胜千字,人们从视觉层更易获取和理解信息. 图形工作 R具有非常强大的绘图功能,看下面代码. > attach(mtcars) > plot(wt, mpg) > abline(lm(mpg~wt)) > title("Regression of MPG on Weight") > detach(m

机器学习实战学习笔记（一）

1.k-近邻算法算法原理: 存在一个样本数据集(训练样本集),并且我们知道样本集中的每个数据与其所属分类的对应关系.输入未知类别的数据后将新数据的每个特征与样本集中数据对应的特征进行比较,然后算法提取样本集中特征最相似(最近邻)的k组数据.然后将k组数据中出现次数最多的分类,来作为新数据的分类. 算法步骤: 计算已知类别数据集中的每一个点与当前点之前的距离.(相似度度量) 按照距离递增次序排序选取与当前点距离最小的k个点确定k个点所在类别的出现频率返回频率最高的类别作为当前点的分类 py

机器学习实战之一---简单讲解决策树

机器学习实战之一---简单讲解决策树 https://blog.csdn.net/class_brick/article/details/78855510 前言:本文基于<机器学习实战>一书,采用python语言,对于机器学习当中的常用算法进行说明. 一. 综述定义:首先来对决策树进行一个定义,决策树是一棵通过事物的特征来进行判断分支后得到该事物所需要的预测的属性的树. 流程:提取特征à计算信息增益à构建决策树à使用决策树进行预测关键:树的构造,通过信息增益(熵)得到分支点和分支的方式.

《你必须知道的.NET》读书笔记三：体验OO之美

一.依赖也是哲学 (1)本质诠释:"不要调用我们,我们会调用你" (2)依赖和耦合: ①无依赖,无耦合: ②单向依赖,耦合度不高: ③双向依赖,耦合度较高: (3)设计的目标:高内聚,低耦合. ①低耦合:实现最简单的依赖关系,尽可能地减少类与类.模块与模块.层次与层次.系统与系统之间的联系: ②高内聚:一方面代表了职责的统一管理,一方面又代表了关系的有效隔离: (4)控制反转(IoC):代码的控制器交由系统控制而不是在代码内部,消除组件或模块间的直接依赖: (5)依赖注入(DI): ①