收集数据
数据来源:http://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival
文本数据如下图所示:
31,65,4,1
33,58,10,1
33,60,0,1
34,59,0,2
34,66,9,2
这是关于乳腺癌已手术患者存活时间(寿命)的样本集,文本文件中共包含306个样本,样本包含的属性有:
1. 患者做手术时的年龄 opAge
2. 患者做手术的年份-1900 opYear,比如1970年做的手术,则opYear属性的值为70
3. 阳性腋窝淋巴结的数目 cellNum
4. 存活时间 status,其中,status等于1表示该患者存活了5年以上,status等于2表示该患者在5年之内死亡。
使用pandas读取文本文件,将数据转换为DataFrame对象:
data = pd.read_table(r"C:\data\Haberman‘s Survival Data.txt", sep=",", header=None, names=[‘opAge‘, ‘opYear‘, ‘cellNum‘, ‘status‘], engine="python")
data1 = data[data[‘status‘] == 1] # status为1的数据
data2 = data[data[‘status‘] == 2] # status为2的数据
查看data.shape为(306, 4)
数据散点图
fig = plt.figure(figsize=(16, 12))
ax = fig.gca(projection="3d") #get current axis
ax.scatter(data1[‘opAge‘], data1[‘opYear‘], data1[‘cellNum‘], c=‘r‘, s=100, marker="*", label="survived 5 years or longer") #status为1的样本的散点图
ax.scatter(data2[‘opAge‘], data2[‘opYear‘], data2[‘cellNum‘], c=‘b‘, s=100, marker="^", label="died within 5 year") #status为2的样本的散点图
ax.set_xlabel("operation age", size=15)
ax.set_ylabel("operation year", size=15)
ax.set_zlabel("cell number", size=15)
ax.set_title(‘Haberman\‘s Survival Scatter Plot‘, size=15, weight=‘bold‘)
ax.set_zlim(0, 30)
ax.legend(loc="lower right", fontsize=15)
plt.show()
得到的3D散点图如下所示:
准备数据
KNN算法的原则是:找到距离目标样本距离最近的K个样本,这K个样本中类别出现次数最多的那个类,作为目标样本的类别。这里采用的距离是欧式距离,那么数值较大的属性在衡量距离的时候起的作用较大,这会导致样本距离衡量出现偏差,为了屏蔽这种影响,需要对数据进行归一化:
# 归一化 (x-min)/(max-min)∈[0,1]
def autoNorm(dataSet):
minVals = samples.min(axis=0) # 按列求最小值,即求每个属性的最小值
maxVals = samples.max(axis=0) # 求每个属性的最大值
factors = maxVals - minVals # 归一化因子
sNum = dataSet.shape[0] # 数据集的行数,即样本数
normDataSet = (dataSet - np.tile(minVals, (sNum, 1))) / np.tile(factors, (sNum, 1)) # 先将minVals和归一化因子转换成与dataSet相同的shape,再做减法和除法运算,最终归一化后的数据都介于[0,1]
return normDataSet
训练算法
划分数据集
采用K折交叉验证的方法来评估分类器的性能,从样本集中随机抽取10%作为测试集testing data,其余的90%用来做10 fold cross validation
testIdxs = random.sample(range(0, len(samples), 1), len(samples) * 1 / 10) # 随机选取testing data的索引
testingSet = samples.ix[testIdxs] # 根据索引从样本集中获取testing data
idxs = range(0, len(samples), 1) # 总的数据索引序列
#以下for循环是从总的数据索引序列中将testing data的索引去除
for i in range(0, len(testIdxs)):
idxs.remove(testIdxs[i])
trainData = samples.ix[idxs] # 获取用作训练的数据集
对trainData使用10折交叉验证,首先随机打乱trainData的索引序列random.shuffle(idxs)
idxs是引用类型,在执行完remove操作后,idxs已只剩trainData的索引。
k nearest neighbor
采用欧氏距离作为距离评价准则:
#inX: 目标样本
#dataSet: 用来找k nearest neighbor的数据集,labels是该数据集对应的类别标签,dataSet和labels的索引是一一对应的
def classifyKNN(inX, dataSet, labels, k):
#以下代码是为了防止出现这种情况:dataSet和labels的索引不是从0开始的有序自然数,导致在argsort排序的时候出现错乱,因为argsort排序结果是从0开始的自然数,因此首先需要重置dataSet和labels的索引,使其索引变为依次从0开始自然数。
nDataSet = zeros((dataSet.shape[0], dataSet.shape[1])) #与dataSet同型的0矩阵
j = 0
for i in dataSet.index:
nDataSet[j] = dataSet.ix[i]
j += 1
nDataSet = pd.DataFrame(nDataSet)
nLabels = zeros(labels.shape[0]) #与labels同型的0向量
h = 0
for i in labels.index:
nLabels[h] = labels.ix[i]
h += 1
dataSetNum = nDataSet.shape[0] # 样本数(DataFrame行数)
diffMat = tile(inX, (dataSetNum, 1)) - nDataSet #目标样本与参照样本集的差,对应属性相减,结果为与nDataSet同型的矩阵
sqDiffMat = diffMat ** 2 #平方
sqDistances = sqDiffMat.sum(axis=1) #矩阵sqDiffMat的列之和,即目标样本与样本集中每个样本对应属性的差值的平方和
distances = sqDistances ** 0.5 #平方根,欧氏距离,即目标样本与每个样本点的距离
sortedDistanceIdx = distances.argsort() # 距离从小到大的索引值,sortedDistanceIdx的索引是从0开始的自然数,sortedDistanceIdx的值表示对应的distance的索引,比如sortedDistanceIdx[0]是150,表示最小的距离在distances中的索引是150
classCount = {}
for i in range(k):
#找出distance最小的k个索引,然后在nLabels中获取其对应类别
voteLabel = nLabels[int(sortedDistanceIdx[i])]
classCount[voteLabel] = classCount.get(voteLabel, 0) + 1
#classCount字典中存放了统计的label和对应的出现次数
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) #倒序
return sortedClassCount[0][0] #出现次数最大的label
10折交叉验证
一次验证
#返回在该验证集上的错误率
def train(trainingSet, validationSet, kn):
errorCount = 0
vIdxs = validationSet.index
#遍历验证集,对每个样本使用KNN
for i in range(0, len(validationSet)):
pred = classifyKNN(validationSet.loc[vIdxs[i], [‘opAge‘, ‘opYear‘, ‘cellNum‘]], trainingSet[[‘opAge‘, ‘opYear‘, ‘cellNum‘]], trainingSet[‘status‘], kn)
if (pred != validationSet.at[vIdxs[i], ‘status‘]):
errorCount += 1
return errorCount / float(len(validationSet))
10次验证
使用10次验证的平均错误率来评价KNN分类器的性能
# dataSet:用来交叉验证的数据集,idxs是对应的索引序列
# k: k折交叉验证
# kn: kn近邻
def crossValidation(dataSet, idxs, k, kn):
step = len(idxs) / k
errorRate = 0
for i in range(k):
validationIdx = []
for i in range(i * step, (i + 1) * step):
validationIdx.append(idxs[i])
validationSet = dataSet.ix[validationIdx] # 获得验证集数据
temp = idxs[:]
for i in validationIdx: # 把验证集的索引去除
temp.remove(i)
trainingSet = dataSet.ix[temp] # 获取训练集数据
errorRate += train(trainingSet, validationSet, kn)
aveErrorRate = errorRate / float(k)
return aveErrorRate
测试算法
交叉验证完毕之后,使用全部的trainData,对testingData进行预测
def predict(trainingSet, testingSet, kn):
errorCount = 0
for i in range(0, len(testingSet)):
vIdxs = testingSet.index
pred = classifyKNN(testingSet.loc[vIdxs[i], [‘opAge‘, ‘opYear‘, ‘cellNum‘]], trainingSet[[‘opAge‘, ‘opYear‘, ‘cellNum‘]], trainingSet[‘status‘], kn)
print "The prediction label is %s"%(pred)
print "The real label is %s"%(testingSet.at[vIdxs[i], ‘status‘])
if (pred != testingSet.at[vIdxs[i], ‘status‘]):
errorCount += 1
return errorCount / float(len(testingSet))
print "The cross validation error ratio is %d" %crossValidation(trainData, idxs, 10, 3)
print "The testing data error ratio is %d"%predict(samples,testingSet,3)
运行结果为:
The cross validation error ratio is 0.28
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 2
The real result is 2
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 2
The prediction result is 1
The real result is 1
The prediction result is 2
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 2
The real result is 2
The prediction result is 1
The real result is 2
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 2
The real result is 2
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 2
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 1
The prediction result is 1
The real result is 2
The testing data error ratio is 0.17
时间: 2024-10-08 09:46:20