一.引入
决策树基本上是每一本机器学习入门书籍必讲的东西,其决策过程和平时我们的思维很相似,所以非常好理解,同时有一堆信息论的东西在里面,也算是一个入门应用,决策树也有回归和分类,但一般来说我们主要讲的是分类
其实,个人感觉,决策树是从一些数据量中提取特征,按照特征的显著由强到弱来排列。常见应用为:回答一些问题,猜出你心里想的是什么?
为什么第一个问题,永远都是男还是女?为什么?看完这个就知道了
二.代码
1 from math import log 2 import operator 3 4 def createDataSet(): 5 dataSet = [[1, 1, ‘yes‘], 6 [1, 1, ‘yes‘], 7 [1, 0, ‘no‘], 8 [0, 1, ‘no‘], 9 [0, 1, ‘no‘]] 10 labels = [‘no surfacing‘,‘flippers‘] 11 #change to discrete values 12 return dataSet, labels 13 14 def calcShannonEnt(dataSet): 15 numEntries = len(dataSet) 16 labelCounts = {} 17 for featVec in dataSet: #the the number of unique elements and their occurance 18 currentLabel = featVec[-1] 19 if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0 20 labelCounts[currentLabel] += 1 21 shannonEnt = 0.0 22 for key in labelCounts: 23 prob = float(labelCounts[key])/numEntries 24 shannonEnt -= prob * log(prob,2) #log base 2 25 return shannonEnt 26 27 def splitDataSet(dataSet, axis, value): 28 retDataSet = [] 29 for featVec in dataSet: 30 if featVec[axis] == value: 31 reducedFeatVec = featVec[:axis] #chop out axis used for splitting 32 reducedFeatVec.extend(featVec[axis+1:]) 33 retDataSet.append(reducedFeatVec) 34 return retDataSet 35 36 def chooseBestFeatureToSplit(dataSet): 37 numFeatures = len(dataSet[0]) - 1 #the last column is used for the labels 38 baseEntropy = calcShannonEnt(dataSet) 39 bestInfoGain = 0.0; bestFeature = -1 40 for i in range(numFeatures): #iterate over all the features 41 featList = [example[i] for example in dataSet]#create a list of all the examples of this feature 42 uniqueVals = set(featList) #get a set of unique values 43 newEntropy = 0.0 44 for value in uniqueVals: 45 subDataSet = splitDataSet(dataSet, i, value) 46 prob = len(subDataSet)/float(len(dataSet)) 47 newEntropy += prob * calcShannonEnt(subDataSet) 48 infoGain = baseEntropy - newEntropy #calculate the info gain; ie reduction in entropy 49 if (infoGain > bestInfoGain): #compare this to the best gain so far 50 bestInfoGain = infoGain #if better than current best, set to best 51 bestFeature = i 52 return bestFeature #returns an integer 53 54 def majorityCnt(classList): 55 classCount={} 56 for vote in classList: 57 if vote not in classCount.keys(): classCount[vote] = 0 58 classCount[vote] += 1 59 sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) 60 return sortedClassCount[0][0] 61 62 def createTree(dataSet,labels): 63 classList = [example[-1] for example in dataSet] 64 if classList.count(classList[0]) == len(classList): 65 return classList[0]#stop splitting when all of the classes are equal 66 if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet 67 return majorityCnt(classList) 68 bestFeat = chooseBestFeatureToSplit(dataSet) 69 bestFeatLabel = labels[bestFeat] 70 myTree = {bestFeatLabel:{}} 71 del(labels[bestFeat]) 72 featValues = [example[bestFeat] for example in dataSet] 73 uniqueVals = set(featValues) 74 for value in uniqueVals: 75 subLabels = labels[:] #copy all of labels, so trees don‘t mess up existing labels 76 myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels) 77 return myTree 78 79 def classify(inputTree,featLabels,testVec): 80 firstStr = inputTree.keys()[0] 81 secondDict = inputTree[firstStr] 82 featIndex = featLabels.index(firstStr) 83 key = testVec[featIndex] 84 valueOfFeat = secondDict[key] 85 if isinstance(valueOfFeat, dict): 86 classLabel = classify(valueOfFeat, featLabels, testVec) 87 else: classLabel = valueOfFeat 88 return classLabel 89 90 def storeTree(inputTree,filename): 91 import pickle 92 fw = open(filename,‘w‘) 93 pickle.dump(inputTree,fw) 94 fw.close() 95 96 def grabTree(filename): 97 import pickle 98 fr = open(filename) 99 return pickle.load(fr) 100
三.算法详解
?信息增益
传入数据集,得到该数据集的增益
1 def calcShannonEnt(dataSet): 2 numEntries = len(dataSet) 3 labelCounts = {} 4 for featVec in dataSet: #the the number of unique elements and their occurance 5 currentLabel = featVec[-1] 6 if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0 7 labelCounts[currentLabel] += 1 8 shannonEnt = 0.0 9 for key in labelCounts: 10 prob = float(labelCounts[key])/numEntries 11 shannonEnt -= prob * log(prob,2) #log base 2 12 return shannonEnt
得到信息熵后,我们按照获取最大信息增益的方法划分数据集就行了
eg.运行下面的数据集
[[1, 1, ‘yes‘], [1, 1, ‘yes‘], [1, 0, ‘no‘], [0, 1, ‘no‘], [0, 1, ‘no‘]] labelCounts是一个map结构
currentLabel labelCounts[currentLabel] prob
yes 2 0.4no 3 0.6 用信息论就可以得到0.4*log(-0.4)+0,6*log(-0.6)=0.971
?划分数据集
※按照给定特征划分数据集
传入数据集,第axis个(从0开始)特征,该特征的值
输出根据该数据集划分得到的子数据集
1 def splitDataSet(dataSet, axis, value): 2 retDataSet = [] 3 for featVec in dataSet: 4 if featVec[axis] == value: 5 reducedFeatVec = featVec[:axis] #chop out axis used for splitting 6 reducedFeatVec.extend(featVec[axis+1:]) 7 retDataSet.append(reducedFeatVec) 8 return retDataSet
eg. myDat为 [[1, 1, ‘yes‘], [1, 1, ‘yes‘], [1, 0, ‘no‘], [0, 1, ‘no‘], [0, 1, ‘no‘]]传入(myDat,0,1),输出 [[1, ‘yes‘],[1, ‘yes‘], [0, ‘no‘]]
※选择最好的数据集划分方式
传入数据集
输出该数据集下按不同特征值排列得到信息熵变化最大的该特征值
1 def chooseBestFeatureToSplit(dataSet): 2 numFeatures = len(dataSet[0]) - 1 #the last column is used for the labels 3 baseEntropy = calcShannonEnt(dataSet) 4 bestInfoGain = 0.0; bestFeature = -1 5 for i in range(numFeatures): #iterate over all the features 6 featList = [example[i] for example in dataSet]#create a list of all the examples of this feature 7 uniqueVals = set(featList) #get a set of unique values 8 newEntropy = 0.0 9 for value in uniqueVals: 10 subDataSet = splitDataSet(dataSet, i, value) 11 prob = len(subDataSet)/float(len(dataSet)) 12 newEntropy += prob * calcShannonEnt(subDataSet) 13 infoGain = baseEntropy - newEntropy #calculate the info gain; ie reduction in entropy 14 if (infoGain > bestInfoGain): #compare this to the best gain so far 15 bestInfoGain = infoGain #if better than current best, set to best 16 bestFeature = i 17 return bestFeature #returns an integer
eg. myDat为 [[1, 1, ‘yes‘], [1, 1, ‘yes‘], [1, 0, ‘no‘], [0, 1, ‘no‘], [0, 1, ‘no‘]]传入(myDat) 第一次就是按第一个特征,值为1划分 按第一个特征,值为0划分 得到该情况下的信息熵第二次就是按第二个特征,值为1划分 按第二个特征,值为0划分 得到该情况下的信息熵......选取信息熵最大时候的特征
?递归构建决策树
1 def majorityCnt(classList): 2 classCount={} 3 for vote in classList: 4 if vote not in classCount.keys(): classCount[vote] = 0 5 classCount[vote] += 1 6 sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) 7 return sortedClassCount[0][0]
O(∩_∩)O~创建树啦
1 def createTree(dataSet,labels): 2 classList = [example[-1] for example in dataSet] 3 if classList.count(classList[0]) == len(classList): 4 return classList[0]#stop splitting when all of the classes are equal 5 if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet 6 return majorityCnt(classList) 7 bestFeat = chooseBestFeatureToSplit(dataSet) 8 bestFeatLabel = labels[bestFeat] 9 myTree = {bestFeatLabel:{}} 10 del(labels[bestFeat]) 11 featValues = [example[bestFeat] for example in dataSet] 12 uniqueVals = set(featValues) 13 for value in uniqueVals: 14 subLabels = labels[:] #copy all of labels, so trees don‘t mess up existing labels 15 myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels) 16 return myTree
O(∩_∩)O~~使用树来决策了
1 def classify(inputTree,featLabels,testVec): 2 firstStr = inputTree.keys()[0] 3 secondDict = inputTree[firstStr] 4 featIndex = featLabels.index(firstStr) 5 key = testVec[featIndex] 6 valueOfFeat = secondDict[key] 7 if isinstance(valueOfFeat, dict): 8 classLabel = classify(valueOfFeat, featLabels, testVec) 9 else: classLabel = valueOfFeat 10 return classLabel
时间: 2024-09-29 09:02:02