数据挖掘——学习笔记：关联规则挖掘

一、概念

　　关联规则挖掘：从食物数据库、关系数据库等大量数据的项集之间发现有趣的、频繁出现的模式、关联和相关性。

　　关联规则的兴趣度度量：support、confidence

　　K-项集：包含K个项的集合

　　项集的频率：包含项集的事务数

　　频繁项集：如果项集的频率大于最小支持度*事务总数，则该项集成为频繁项集

二、关联规则挖掘的分类

　　1、根据规则中所处理的值类型：布尔关联规则、量化关联规则

　　2、根据规则中涉及的数据维：单维关联规则、多维关联规则

　　3、根据规则所涉及的抽象层：单层关联规则、多层关联规则

　　4、根据关联挖掘的各种扩充：挖掘最大的频繁模式、挖掘频繁闭项集

三、大型数据库中的关联规则挖掘过程

　　1、找出所有频繁项集，大部分的计算都集中在这一步

　　2、由频繁项集产生强关联规则，即满足最小支持度和最小置信度的规则

四、找出频繁项集的算法：Apriori algorithm

Apriori algorithm 利用频繁项集的先验知识(prior knowledge)，通过逐层搜索的迭代方法，即将K-项集用于探察(K+1)项集，，来穷尽数据集中地所有频繁项集。

To improve the effciency of the level-wise generation of frequent itemsets,an important property called the Apriori property is used to reduce the search space.

Apriori property:All nonempty subsets of a frequent itemset must also be frequent.

Apriori algorithm 步骤：

1. The join step:为了计算L_k,通过L_k-1与自己连接产生候选K-项集的集合，该候选K项集称作C_k。

Lk-1中的两个元素L1和L2可以执行连接操作的条件是

C_k中的频繁集即为L_k

2. The prune step:利用Apriori property减少计算量。

Algorithm:Apriori.Find frequent itensets using an iterative level-wise approach based on cadidate generation.

Input:

D,a database of transaction;

min_sup,the minimum support count threshold.

Output:L,frequent itemsets in D.

Method:

　　L₁=find_frequent_1-itemsets(D);
　　for(k=2;L_k-1!=NULL;k++){
　　　　C_k=apriori_gen(L_k-1);
　　　　for each transaction t belont to D{
　　　　　　C_t=subset(C_k,t);
　　　　　　for each candidate c belong to C_t
　　　　　　c.count++;
　　　　}
　　　　L_k={c belong to C_k|c.count >=min_sup}
　　}
　　return L=U_kL_k;

procedure apriori_gen(L_k-1:frequent(k-1)-itemsets)
　　for each itemset l₁ belong to L_k-1
　　　　for eachitemset l₂ belong to L_k-1
　　　　　　if(l₁[1]=l₂[1] & l₁[2]=l₂[2] & ... & l₁[k-2]=l₂[k-2] & l₁[k-1]<l₂[k-1])then{
　　　　　　c=l₁ join l₂;//join sep:generate candidates
　　　　　　if has_infrequent_subset(c,L_k-1)then
　　　　　　　　delete c;//prune step:remove unfruitful candidate
　　　　　　else add c to C_k;
　　}
　　return C_k;

procedure has_infrequent_subset(c:candidate k-itemset;;L_k-1:frequent(k-1)-itemsets);//use prior knowledge
　　for each (k-1)-subset s of c
　　　　if s not belong to L_k-1 then
　　　　　　return TRUE;
　　return FALSE;

Apriori算法缺点：

　　1、对数据进行多次扫描；

　　2、产生大量的候选项集；

　　3、对候选集的支持度计算繁琐

解决思路：

　　1、减少扫描次数；

　　2、缩小候选集；

　　3.改进支持度计算方法

方法一：Hash-based technique

将每个项集通通过Hash函数映射到Hash标的不同桶中，这样可以通过将桶中的项集计数与最小支持计数相比较先淘汰一部分项集。

方法二：Transaction reduction

不包含任何K项集的事务不可能包含K+1项集。因此这样的项集可以从考虑的项集中被标记或移除

方法三：Partitioning

方法四：sampling

方法五：Dynamic itemset counting

Apriori算法的主要开销是产生大量的候选频繁项集，FP-tree算法可以发现频发模式而不产生候选

时间： 2024-08-09 22:00:01

数据挖掘——学习笔记：关联规则挖掘

一、概念

二、关联规则挖掘的分类

三、大型数据库中的关联规则挖掘过程

四、找出频繁项集的算法：Apriori algorithm

数据挖掘——学习笔记：关联规则挖掘的相关文章

[数据挖掘课程笔记]关联规则挖掘

数据挖掘算法之-关联规则挖掘(Association Rule)（购物篮分析）

数据挖掘算法之关联规则挖掘（一）---apriori算法

浅谈数据挖掘中的关联规则挖掘

数据挖掘学习笔记一：引论

数据挖掘算法之关联规则挖掘（二）FPGrowth算法

数据挖掘学习笔记多维数据模型-数据立方体

数据挖掘学习笔记：分类器（二）

学习笔记:Oracle dul数据挖掘导出Oracle11G数据文件坏块中表中