例子:某人想要由以下1000行训练样本数据构建一个分类器,将数据分成3类(喜欢,一般,不喜欢)。样本数据的特征有主要有3个,
A:每年获得的飞行常客里程数
B:玩视频游戏所耗时间百分比
C:每周消费冰淇淋公升数
1. 数据的读取
1 filename=‘D://machine_learn//Ch02//datingTestSet2.txt‘ 2 def file2matrix(filename): 3 fr = open(filename) 4 a=fr.readlines() 5 numberOfLines = len(a) #get the number of lines in the file 6 returnMat = zeros((numberOfLines,3)) #prepare matrix to return 7 classLabelVector = [] #prepare labels return 8 index=0 9 for line in a: 10 line = line.strip() 11 listFromLine = line.split(‘\t‘) 12 returnMat[index,:] = listFromLine[0:3] #第index行=右边数据 13 classLabelVector.append(int(listFromLine[-1])) 14 index += 1 15 return returnMat,classLabelVector 16 data,labels=file2matrix(filename)
data
2. 数据的归一化处理:由于A的特征值远大于B,C的特征值,因此为了使3个特征转化为真正等权重的特征,需要进行数据标准化操作
1 def autoNorm(dataSet): 2 minVals = dataSet.min(0) #矩阵中每一列的最小值 3 maxVals = dataSet.max(0) #矩阵中每一列的最大值 4 ranges = maxVals - minVals 5 normDataSet = zeros(shape(dataSet)) 6 m = dataSet.shape[0] 7 normDataSet = dataSet - tile(minVals, (m,1)) 8 normDataSet = normDataSet/tile(ranges, (m,1)) #element wise divide 9 return normDataSet, ranges, minVals
autoNorm(dataSet)
3.应用kNN算法进行分类
3.1 首先简述knn-算法的思想
3.2 python 实现knn
1 def classify0(inX, dataSet, labels, k): 2 dataSetSize = dataSet.shape[0] 3 diffMat = tile(inX, (dataSetSize,1)) - dataSet 4 sqDiffMat = diffMat**2 5 sqDistances = sqDiffMat.sum(axis=1) 6 distances = sqDistances**0.5 7 sortedDistIndicies = distances.argsort() 8 classCount={} 9 for i in range(k): 10 voteIlabel = labels[sortedDistIndicies[i]] 11 classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1 12 sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) 13 return sortedClassCount[0][0]
knn-classify0
3.3 在上述数据中应用knn,并且计算出误判率
1 def datingClassTest(): 2 hoRatio = 0.50 #hold out 10% 3 datingDataMat,datingLabels = file2matrix(‘datingTestSet2.txt‘) #load data setfrom file 4 normMat, ranges, minVals = autoNorm(datingDataMat) 5 m = normMat.shape[0] 6 numTestVecs = int(m*hoRatio) 7 errorCount = 0.0 8 for i in range(numTestVecs): 9 classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3) 10 print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i]) 11 if (classifierResult != datingLabels[i]): errorCount += 1.0 12 print "the total error rate is: %f" % (errorCount/float(numTestVecs)) 13 print errorCount
datingClassTest
4. 可视化分类结果
1 import matplotlib 2 import matplotlib.pyplot as plt 3 fig=plt.figure() 4 ax=fig.add_subplot(111) 5 #ax.scatter(data[:,0],data[:,1]) 6 ax.set_xlabel(‘B‘) 7 ax.set_ylabel(‘C‘) 8 ax.scatter(data[:,1],data[:,2],15.0*array(labels),array(labels)) 9 ax.scatter([20,20,20],[1.8,1.6,1.4],15*array(list(set(labels))),list(set(labels))) 10 legends=[‘dislike‘,‘smallDoses‘,‘largeDoses‘] 11 ax.text(22,1.8,‘%s‘ %(legends[0])) 12 ax.text(22,1.6,‘%s‘ %(legends[1])) 13 ax.text(22,1.4,‘%s‘ %(legends[2])) 14 plt.show()
scatter
时间: 2024-11-06 18:12:44