SVD是一种提取信息的强大工具,通过SVD实现我们能够用小的多的数据集来表示原始数据集,这样做实际就是去除噪声和冗余信息。
隐性语义索引
SVD最早应用就是信息检索,我们称利用SVD方法为隐性语义索引(LSI),在LSI中一个矩阵是由文档和词语组成,当应用SVD到矩阵上时,就会构建多个奇异值。这些奇异值代表了文档中概念或主题,这一特点可以更高效的文档搜索。
推荐系统
SVD的另外一个应用就是推荐系统,简单版本实现推荐系统就是计算item或者user之间相似性。更先进的方法就是利用SVD从数据中构建一个主题空间,然后在该空间下计算相似度。
基于python对SVD方法在简单推荐系统中实现。
中间用到了python两个很常用函数方法
sorted 方法和 nonzero 方法。
sorted方法是python内置的方法,我们实现中要用到对元组进行排序,如下:
>>> student_tuples = [ ('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10), ] >>> sorted(student_tuples, key=lambda student: student[2]) # sort by age [('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]
参考链接:sorted方法
nonzero返回二维的不为0的index,看例子:
>>> a = np.array([[1,2,3],[4,5,6],[7,8,9]]) >>> a > 3 array([[False, False, False], [ True, True, True], [ True, True, True]], dtype=bool) >>> np.nonzero(a > 3) (array([1, 1, 1, 2, 2, 2]), array([0, 1, 2, 0, 1, 2]))
参考链接:nonzero方法
# encoding=utf8 import numpy as np from numpy import * from numpy import linalg as la from operator import itemgetter def loadExData(): return [[4,4,0,2,2], [4,0,0,3,3], [4,0,0,1,1], [1,1,1,2,0], [2,2,2,0,0], [1,1,1,0,0], [5,5,5,0,0]] def loadExData2(): return[[0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 5], [0, 0, 0, 3, 0, 4, 0, 0, 0, 0, 3], [0, 0, 0, 0, 4, 0, 0, 1, 0, 4, 0], [3, 3, 4, 0, 0, 0, 0, 2, 2, 0, 0], [5, 4, 5, 0, 0, 0, 0, 5, 5, 0, 0], [0, 0, 0, 0, 5, 0, 1, 0, 0, 5, 0], [4, 3, 4, 0, 0, 0, 0, 5, 5, 0, 1], [0, 0, 0, 4, 0, 4, 0, 0, 0, 0, 4], [0, 0, 0, 2, 0, 2, 5, 0, 0, 1, 2], [0, 0, 0, 0, 5, 0, 0, 0, 0, 4, 0], [1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0]] def ReconstructSigma(Sigma): return np.mat([[Sigma[0],0,0],[0,Sigma[1],0],[0,0,Sigma[2]]]) def ReconstructData(U,Sigma,VT): return U[:,:3]*Sigma*VT[:3,:] # 计算相似性函数 def eulidSim(inA,inB): return 1.0/(1.0 + la.norm(inA - inB))#默认计算列做为一个元素之间的距离 def pearsSim(inA,inB): if(len(inA)<3): return 1.0 return 0.5 + 0.5*np.corrcoef(inA, inB, rowvar=0)[0][1]# 这里返回是一个矩阵,只拿第一行第二个元素 def cosSim(inA,inB): num = float(inA.T * inB) denom = la.norm(inA) * la.norm(inB) return 0.5 + 0.5 * (num/denom) ''' standEst 需要做的就是估计user 的item 评分, 采用方法是 根据物品相似性,及每一列相似性 要估计item那一列与其他列进行相似性估计,获得两列都不为0的元素计算相似性 然后用相似性乘以 评分来估计未评分的数值 。 ''' def standEst(dataMat,user,simMeas,item): n = np.shape(dataMat)[1] simTotal = 0.0 ; ratSimTotal = 0.0 for j in range(n): userRating = dataMat[user,j] if(userRating == 0): continue overLap = nonzero(logical_and(dataMat[:,item].A > 0,dataMat[:,j].A >0))[0]# 返回元素不为0的下标 ''' nonzero 返回参考下面例子,返回二维数组,第一维是列方向,第二位是行方向 ''' if(len(overLap)) == 0 :similarity = 0 else: similarity = simMeas(dataMat[overLap,item],dataMat[overLap,j]) print 'the %d and %d similarity is : %f' %(item,j,similarity) simTotal += similarity ratSimTotal += similarity * userRating if simTotal == 0: return 0 else : return ratSimTotal/simTotal def recommend(dataMat,user ,N = 3,simMeas= cosSim,estMethod = standEst): unratedItems = nonzero(dataMat[user,:].A == 0)[1]# .A 使得矩阵类型转为array ''' >>> a = np.array([[1,2,3],[4,5,6],[7,8,9]]) >>> a > 3 array([[False, False, False], [ True, True, True], [ True, True, True]], dtype=bool) >>> np.nonzero(a > 3) (array([1, 1, 1, 2, 2, 2]), array([0, 1, 2, 0, 1, 2])) ''' if len(unratedItems) == 0: return 'you rated everything' itemScores = [] for item in unratedItems: estimatedScore = estMethod(dataMat,user,simMeas,item) itemScores.append((item,estimatedScore)) return sorted(itemScores,key=itemgetter(1),reverse = True)[:N] def svdEst(dataMat , user, simMeas,item): n = shape(dataMat)[1] simTotal = 0.0 ; ratSimTotal = 0.0 U,Sigma,VT = la.svd(dataMat) Sig4 = mat(eye(4) * Sigma[:4]) # 保留最大三个奇异值 xformedItems = dataMat.T * U[:,:4] * Sig4.I print xformedItems for j in range(n): userRating = dataMat[user,j] if userRating == 0 or j==item: continue similarity = simMeas(xformedItems[item,:].T, xformedItems[j,:].T) print 'the %d and %d similarity is: %f' % (item, j, similarity) simTotal += similarity ratSimTotal += similarity * userRating if simTotal == 0: return 0 else: return ratSimTotal/simTotal if __name__=="__main__": ''' # 测试中间数据 Data = loadExData() MatData = np.mat(Data) U,Sigma,VT = np.linalg.svd(Data) print Sigma Sigma = ReconstructSigma(Sigma) print Sigma print ReconstructData(U, Sigma, VT) print eulidSim(MatData[:,0], MatData[:,4]) print cosSim(MatData[:,0], MatData[:,4]) print pearsSim(MatData[:,0], MatData[:,0]) ''' Data = loadExData() dataMat = np.mat(Data) dataMat2 = mat(loadExData2()) print dataMat2 print recommend(dataMat2, 1,estMethod=svdEst)
实现细节参考机器学习实战。
时间: 2024-11-05 10:29:32