《机器学习实战》菜鸟学习笔记（一）

《机器学习实战》终于到手了，开始学习了。由于本人python学的比较挫，所以学习笔记里会有许多python的内容。

1、 python及其各种插件的安装

由于我使用了win8.1 64位系统（正版的哦），所以像numpy 和 matploblib这种常用的插件不太好装，解决方案就是Anaconda-2.0.1-Windows-x86_64.exe 一次性搞定。

2、kNN代码

 1 #-*-coding:utf-8-*-
 2 from numpy import *
 3 import operator
 4
 5 def createDataSet():
 6     group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
 7     labels = [‘A‘,‘A‘,‘B‘,‘B‘]
 8     return group,labels
 9
10 def classify0(inX,dataSet,labels,k):
11     dataSetSize = dataSet.shape[0]
12     diffMat = tile(inX,(dataSetSize,1))-dataSet
13     sqDiffMat = diffMat ** 2
14     sqDistances = sqDiffMat.sum(axis = 1)
15     distances = sqDistances ** 0.5
16     sortedDistIndicies = distances.argsort()  #indices
17     classCount = {}
18     for i in range(k):
19         voteIlabel = labels[sortedDistIndicies[i]]
20         classCount[voteIlabel] = classCount.get(voteIlabel,0)+1
21         #找出最大的那个
22     sortedClassCount = sorted(classCount.iteritems(),
23         key = operator.itemgetter(1),reverse = True)
24     return sortedClassCount[0][0]

这里的疑惑主要出现在：

（1）array与list有什么区别

array 是numpy里面定义的。为了方便计算，比如

1 array([1,2])+array([3,4])
2 [1,2]+[3,4]

执行以下就可以知道他们的差别了

（2）shape[0]返回的是哪一维度的大小（不要嘲笑我小白，我真的不知道）

找到文档看了一下就开朗了。ndarray.shape “the dimensions of the array. This is a tuple of integers indicating the size of the array in each dimension. For a matrix with n rows and m columns, shape will be (n,m). The length of the shape tuple is therefore the rank, or number of dimensions, ndim.”

（3）tile函数

tile函数是经常使用的函数，用于扩充array。举例：

1 >>> b = np.array([[1, 2], [3, 4]])
2 >>> np.tile(b, 2)
3 array([[1, 2, 1, 2],
4        [3, 4, 3, 4]])
5 >>> np.tile(b, (2, 1))
6 array([[1, 2],
7        [3, 4],
8        [1, 2],
9        [3, 4]])

这下就懂了吧。为什么要用这个函数呢？因为后面两个array要做差，这样做就可以不用使用循环了，典型的空间换时间。那么为什么要做差呢？好吧，因为这是knn算法。

（4）array的sum函数

写到这里，我决定要好好读读numpy文档了。

numpy.sum(a, axis=None, dtype=None, out=None, keepdims=False)

一个sum函数还是挺麻烦的呢

>>> np.sum([[0, 1], [0, 5]], axis=0)
array([0, 6])
>>> np.sum([[0, 1], [0, 5]], axis=1)
array([1, 5])

这样大家都清楚了

（5）最后一行，return了什么？

表面看起来像是二维数组的第一个元素，但是sortedClassCount是二维数组吗？

写了一个小的验证程序，发现sortedClassCount是一个list，元素是tuple。

L = {1:12,3:4}
sortedL = sorted(L.iteritems(),key=operator.itemgetter(1))
print sortedL
#结果
[(3, 4), (1, 12)]

时间： 2024-10-10 21:43:47

《机器学习实战》菜鸟学习笔记（一）

《机器学习实战》菜鸟学习笔记（一）的相关文章

《机器学习实战》学习笔记：k-近邻算法实现

《机器学习实战》学习笔记：支持向量机

Python3《机器学习实战》学习笔记

《机器学习实战》学习笔记：利用Adaboost元算法提高分类性能

《机器学习实战》学习笔记：基于朴素贝叶斯的分类方法

《机器学习实战》学习笔记：Logistic回归&预测疝气病证的死亡率

《机器学习实战》学习笔记——第2章 KNN

《机器学习实战》学习笔记：基于朴素贝叶斯的垃圾邮件过滤

《机器学习实战》学习笔记：决策树的实现

《机器学习实战》学习笔记——k近邻算法