Efficient Pattern Mining Methods

@(Pattern Discovery in Data Mining)

本文介绍了几个模式挖掘的高效算法。主要以Apriori思想为框架，主要讲解了FP-Growth算法。

The Downward Closure Property of Frequent Patterns

Property

The downward closure (also called “Apriori”) property of frequent patterns:
- If {beer, diaper, nuts} is frequent, so is {beer, diaper}
- Every transaction containing {beer, diaper, nuts} also contains {beer, diaper}
- Apriori: Any subset of a frequent itemset must be frequent
Efficient mining methodology
- If any subset of an itemset S is infrequent, then there is no chance for S to
  
  ??be frequent — why do we even have to consider S!? (It is an efficient way to prune)
Principle
Apriori pruning principle: If there is any itemset which is infrequent, its superset should not even be generated! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
Scalable mining Methods
- Level-wise, join-based approach: Apriori (Agrawal &[email protected]’94)
- Vertical data format approach: Eclat (Zaki, Parthasarathy, Ogihara, Li @KDD’97)
- Frequent pattern projection and growth: FPgrowth (Han, Pei, Yin @SIGMOD’00)

The Apriori Algorithm

Outline of Apriori (level-wise, candidate generation and test)
- Initially, scan DB once to get frequent 1-itemset
- Repeat
  - Generate length-(k+1) candidate itemsets from length-k frequent itemsets
  - Test the candidates against DB to find frequent (k+1)-itemsets
  - Set k := k +1
- Until no frequent or candidate set can be generated
- Return all the frequent itemsets derived
Psuedo Code
Tricks
joining & pruning

这里，对于某此迭代产生的joining结果，检验任何一个k-1的子集是否在候选集Ck中。即上图pruning的过程。

Extensions or Improvements of Apriori

Reduce passes of transaction database scans
- Partitioning (e.g., Savasere, et al., 1995)
- Dynamic itemset counting (Brin, et al., 1997) —> one of Google’s cofounder
Shrink the number of candidates
- Hashing (e.g., DHP: Park, et al., 1995)
- Pruning by support lower bounding (e.g., Bayardo 1998) ? Sampling (e.g., Toivonen, 1996)
Exploring special data structures
- Tree projection (Aggarwal, et al., 2001)
- H-miner (Pei, et al., 2001)
- Hypecube decomposition (e.g., LCM: Uno, et al., 2004)

Mining Frequent Patterns by Exploring Vertical Data Format

FPGrowth: A Frequent Pattern-Growth Approach

构造FP-Tree，快速迭代生成frequent patterns

什么是FP-Tree？如何构造FP-Tree？
- 计算每个single itemset的frequency
- 将每个个Transaction中item的根据frequency进行排序
- 类似于前缀树，生成FP-Tree，其中每个节点代表了一个item

生成结果如下所示：

生成Frequent Itemset

利用分治的方法进行迭代计算。

过程（设min_sup=2，以e后缀为例）：

1）得到e的前缀路径子树

2）计算e的频数，判断e是否是frequent item。方法是遍历e节点的链表（虚线连接）计算节点数目，得sup(e)=3 > 2，所以继续下述步骤。

3）因为e是频繁的，找到所有以e结尾的frequent itemlist，也就是说，分拆问题，进行迭代。

这里我们首先需要拿到e的Conditional FP-Tree。

4）Conditional FP-Tree的生成：

结果：比较挫，直接看图

步骤：

1 - 更新e的前缀路径子树中的support值

2 - 删除包含e的节点

3 - 删除不频繁(infrequent)的节点，这里的c和d根据前述计算频数的方法知道满足最小support条件。至此，已经得到了关于e的Conditional FP-Tree。

5）利用前面得到的关于e的CFPT，找到所有以de、ce、ae结尾（be不考虑因为b已经被删除）的frequent itemlist。这里直接调用分治的过程进行递归。例如对于de来说，在e的CFPT中找到关于de的前缀路径子树……得到de的CFPT。

例如e→de→ade，e→ce，e→ae

讨论：对于单枝前缀路径子树，一次就能生成所有frequent patterns。例如：

* 此处红框选中的子树是m-cond. based，单枝树为{}-f3-c3-a3.

* 这是一个迭代的过程，节点a产生了第二棵树{}-f3-c3. 节点c产生了树{}-f3.

然后第二棵树中的节点c产生了最后一棵树{}-f3. 节点f无法再产生新的树。

* 第一棵树是m-cond. based，产生了组合fm，cm，am

* 第二棵树是am-cond. based，产生了fam，cam

* 第三棵树是cm-cond. based，产生了fcm

* 最后一棵树产生了fcam

* 所以我们可以得到并集 fm, cm, am, fam, fcm, cam, fcam。

课程习题：

这里Parallel Project比较耗费空间，因为它是根据需要计算的不同的X-cond. based进行任务分割来计算的，但是比较快；而Partition则根据树枝进行切分，这样是真正意义上的“partition”。

Mining Closed Patterns

时间： 2024-08-13 22:00:49

Efficient Pattern Mining Methods的相关文章

Constraint-Based Pattern Mining

在数据挖掘中,如何进行有约束地挖掘,如何对待挖掘数据进行条件约束与筛选,是本文探讨的话题. Why do we use constraint-based pattern mining? Because we'd like to apply different pruning methods to constrain pattern mining process. And for those reasons: Finding all the patterns in a dataset autono

Spark FPGrowth (Frequent Pattern Mining)

给定交易数据集,FP增长的第一步是计算项目频率并识别频繁项目.与为同样目的设计的类似Apriori的算法不同,FP增长的第二步使用后缀树(FP-tree)结构来编码事务,而不会显式生成候选集,生成的代价通常很高.第二步之后,可以从FP树中提取频繁项集. import org.apache.spark.sql.SparkSession import org.apache.spark.mllib.fpm.FPGrowth import org.apache.spark.rdd.RDD val spa

Mining Diverse Patterns

Mining Diverse Patterns @(Pattern Discovery in Data Mining) Mining Diverse Patterns Mining Multi-level Association Rules Mining Multi-dimensional Associations Mining Quantitative Associations Mining Negative Correlations Mining Compressed Patterns Mi

数据挖掘文章翻译--Mining Emerging Patterns by Streaming Feature Selection

学习数据挖掘,可以用到的工具-机器学习,SPSS(IBM),MATLAB,HADOOP,建议业余时间都看文章,扩充视野,下面是本人翻译的一篇文章,供大家学习.另外,本人感兴趣的领域是机器学习,大数据,目标跟踪方面,有兴趣的可以互相学习一下,本人Q Q邮箱 657831414.,word格式翻译和理解可以发邮件 " 原文题目是Mining Emerging Patterns by Streaming Feature Selection 通过流特征的选择挖掘显露模式俞奎,丁薇,Dan A. Sim

Pattern Discovery Basic Concepts

Pattern Discovery Basic Concepts @(Pattern Discovery in Data Mining)[Pattern Discovery] 本文介绍了基本的模式挖掘的概念 Pattern: A set of items, subsequences, or substructures that occur frequently together (or strongly correlated) in a data set. Motivation to do pa

Introduction - Notes of Data Mining

Introduction @(Pattern Discovery in Data Mining)[Data Mining, Notes] Jiawei Han的Pattern Discovery课程笔记 Why data mining? data explosion and abundant(but unstructured) data everywhere drowning in data but starving in knowledge keyword: interdisciplinary

Awesome Machine Learning

Awesome Machine Learning A curated list of awesome machine learning frameworks, libraries and software (by language). Inspired by awesome-php. If you want to contribute to this list (please do), send me a pull request or contact me @josephmisiti Als

数据挖掘方面重要会议的最佳paper集合

数据挖掘方面重要会议的最佳paper集合,后续将陆续分析一下内容: 主要有KDD.SIGMOD.VLDB.ICML.SIGIR KDD (Data Mining) 2013 Simple and Deterministic Matrix Sketching Edo Liberty, Yahoo! Research 2012 Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping T

KDD2015,Accepted Papers

Accepted Papers by Session Research Session RT01: Social and Graphs 1Tuesday 10:20 am–12:00 pm | Level 3 – Ballroom AChair: Tanya Berger-Wolf Efficient Algorithms for Public-Private Social NetworksFlavio Chierichetti,Sapienza University of Rome; Ales