自己的一点点领悟,可能会有点小错误,欢迎交流^_^
获得频繁项集
主要思想
python代码
def loadDataSet():
return [[1,3,4],[2,3,5],[1,2,3,5],[2,5]]
createC1(dataSet)获得所有第一层的所有项集
def createC1(dataSet):
C1 = []
for transaction in dataSet:
for item in transaction:
if not [item] in C1:
C1.append([item])
C1.sort()
return map(frozenset,C1)
#scanD是根据训练数据D,来判断Ck里面一堆的项集是否是频繁的。
def scanD(D,Ck,minSupport):
ssCnt = {}
for tid in D:
for can in Ck:
if can.issubset(tid):
if not ssCnt.has_key(can): ssCnt[can] = 1
else: ssCnt[can] += 1
numItems = float(len(D))
retList = []
supportData = {}
for key in ssCnt:
support = ssCnt[key] / numItems
if support >= minSupport:
retList.insert(0,key)
supportData[key] = support
return retList,supportData
#根据前一层的项集的合并得到下一层的。比如
#值得注意的是这样得到的下一层不一定就是频繁项集,还得进行k-2次的判断
{1,2} {3,4} {1,3} 就可以得到{1,2,3}
def aprioriGen(Lk,k):
retList = []
lenLk = len(Lk)
for i in range(lenLk):
for j in range(i+1,lenLk):
L1=list(Lk[i])[:k-2];L2=list(Lk[j])[:k-2]
L1.sort();L2.sort()
if L1==L2:
retList.append(Lk[i] | Lk[j])
return retList
#主函数,给出数据返回频繁项集
def apriori(dataSet,minSupport=0.5):
C1 = createC1(dataSet)
D = map(set,dataSet)
L1,supportData = scanD(D,C1,minSupport)
L = [L1]
k = 2
while (len(L[k-2]) > 0):
Ck = aprioriGen(L[k-2],k)
Lk,supK=scanD(D,Ck,minSupport)
supportData.update(supK)
L.append(Lk)
k += 1
return L,supportData
根据频繁项集获得关联规则
主要思想
只看规则的右边发现就是之前获得频繁项集的方法哦
然后对于一个频繁项集定义的规则必须包含所有的元素,那么只要一个规则的右边确定了的话,规则的左边=频繁项集-右边的。下面就是用H规则右边的可能情况表示。
pythoh代码
//主函数. 初始状态 使得规则右边也就是H只有一个元素。
def generateRules(L,supportData,minConf=0.7):
bigRuleList=[]
for i in range(1,len(L)):
for freqSet in L[i]:
H1 = [frozenset([item]) for item in freqSet]
if(i > 1):
rulesFromConseq(freqSet,H1,supportData, bigRuleList,minConf)
else:
calcConf(freqSet,H1,supportData,bigRuleList, minConf)
return bigRuleList
//计算规则的支持度是否符合要求。最后返回所有可能的 规则右边的集合prunedH. brl存放了所有满足要求的规则。
def calcConf(freqSet,H,supportData,brl,minConf=0.7):
prunedH = []
for conseq in H:
conf = supportData[freqSet] / supportData[freqSet-conseq]
if conf >= minConf:
print freqSet-conseq,‘-->‘,conseq,‘conf:‘,conf
brl.append((freqSet-conseq,conseq,conf))
prunedH.append(conseq)
return prunedH
//就像频繁项集一样,试图对规则的右边也就是H进行合并.然后产生新的规则
def rulesFromConseq(freqSet,H,supportData,brl,minConf=0.7):
m = len(H[0])
if (len(freqSet) > (m+1)):
Hmp1 = aprioriGen(H,m+1)
Hmp1 = calcConf(freqSet,Hmp1,supportData,brl,minConf)
if (len(Hmp1)>1):
rulesFromConseq(freqSet,Hmp1,supportData,brl,minConf)
注意点
apriori
转自Henry
At each level kk, you have kk-item sets which are frequent (have sufficent support).
At the next level, the kk+11-item sets you need to consider must have the property that each of their subsets must be frequent (have sufficent support). This is the apriori property: any subset of frequent itemset must be frequent.
So if you know at level 2 that the sets {1,2}{1,2}, {1,3}{1,3}, {1,5}{1,5} and {3,5}{3,5} are the only sets with sufficient support, then at level 3 you join these with each other to produce {1,2,3}{1,2,3}, {1,2,5}{1,2,5}, {1,3,5}{1,3,5} and {2,3,5}{2,3,5} but you need only consider {1,3,5}{1,3,5} further: the others each have subsets with insufficent support (such as {2,3}{2,3} or {2,5}{2,5} ).
极大频繁集
包含他的都不是频繁集
闭频繁集
包含他的支持度计数都小于他
习题
1
2
(a) s({e}) = 0.8 s({b,d}) = 0.2 s({b,d,e}) = 0.2
3
(a) C(?→A)=S(A)
(b) c1>c2,c2<c3 -> c1>=c2,c2 <= c3
(c) 规则具有相同的置信度->支持度
也就是left->right {left,rigth}的支持度一样
6
(a) 36?26?2+1=602
(b) 4
(c) 5+C(4,3)+1+C(4,3) -> C(6,3)
(d) 黄油,面包
7
(b) {1,2,3,4},{1,2,3,5},{1,2,4,5},{1,3,4,5},{2,3,4,5}
(c) {1,2,3,4},{1,2,3,5}, //无{1,4,5},无{2,4,5}
8
- 在画图的时候要注意,不仅仅是I的时候要向下画N,在是N的时候也也要向下画N。
- F/total
- I/total