Mining Diverse Patterns

Mining Diverse Patterns

@(Pattern Discovery in Data Mining)

  • Mining Diverse Patterns

      • Mining Multi-level Association Rules
      • Mining Multi-dimensional Associations
      • Mining Quantitative Associations
      • Mining Negative Correlations
      • Mining Compressed Patterns
      • Mining Colossal Patterns

Mining Multi-level Association Rules

The intuition to set hierarchical min_sup: Level-reduced min-support (Items at the lower level are expected to have lower support)

Efficient mining: Shared multi-level mining (Use the lowest min-support to pass down the set of candidates)

Redundancy Filtering at Mining Multi-Level Associations:

* Multi-level association mining may generate many redundant rules

* Redundancy filtering: Some rules may be redundant due to “ancestor” relationships between items

* (Suppose the 2% milk sold is about 1?4 of milk sold in gallons)

1. milk → wheat bread [support = 8%, confidence = 70%]

2. 2% milk → wheat bread [support = 2%, confidence = 72%]

  • A rule is redundant if its support is close to the “expected” value, according to its “ancestor” rule, and it has a similar confidence as its “ancestor”
  • Rule (1) is an ancestor of rule (2), so rule(2) is to prune.

Customized Min-Supports for Different Kinds of Items

* We have used the same min-support threshold for all the items or item sets to be mined in each association mining

* In reality, some items (e.g., diamond, watch, …) are valuable but less frequent

* It is necessary to have customized min-support settings for different kinds of items

* One Method: Use group-based “individualized” min-support

* E.g., {diamond, watch}: 0.05%; {bread, milk}: 5%; …

* How to mine such rules efficiently?

* Existing scalable mining algorithms can be easily extended to cover such cases

Mining Multi-dimensional Associations

  • Single-dimensional rules (e.g., items are all in “product” dimension)

    • buys(X, “milk”) → buys(X, “bread”)
  • Multi-dimensional rules (i.e., items in ≥ 2 dimensions or predicates)
    • Inter-dimension association rules (no repeated predicates)

      • age(X, “18-25”) ∧ occupation(X, “student”) → buys(X, “coke”)
    • Hybrid-dimension association rules (repeated predicates)
      • age(X, “18-25”) ∧ buys(X, “popcorn”) → buys(X, “coke”)
  • Attributes can be categorical or numerical
    • Categorical Attributes (e.g., profession, product: no ordering among values): Data cube for inter-dimension association
    • Quantitative Attributes: Numeric, implicit ordering among values— discretization, clustering, and gradient approaches

Mining Quantitative Associations

Mining Negative Correlations

  • Rare Pattern vs. Negative Pattern

  • Defining Negative Correlated Patterns
  • Support-based definition

  • Kulczynski measure-based difinision

  • Exercise

Mining Compressed Patterns

Given a table of patterns and their supports:

Why mining compressed patterns? Since there are too many scattered patterns but not so meaningful.

We can find that P1 and P2 are similar both in item-sets and support, and so do P1 and P5 with similar item-sets. But how to compressed those similar patterns?

We can also analyze about it that:

* Closed patterns

* P1, P2, P3, P4, P5(all have no identical supports)

* Emphasizes too much on support

* There is no compression

* Max-patterns

* P3: information loss

* Desired output (a good balance):

* P2, P3, P4

So we can define some compressing method

  1. pattern distance measure

    Dist(P1,P2)=1?|T(P1)∪T(P2)||T(P1)∩T(P2)|

    • δ-clustering: For each pattern P, find all patterns which can be expressed by P and whose distance to P is within δ (δ-cover)
    • All patterns in the cluster can be represented by P
    • Method for efficient, direct mining of compressed frequent patterns (e.g., Xin et al., VLDB’05)
  2. Redundancy-Aware Top-k Patterns

Mining Colossal Patterns

时间: 2024-10-23 21:05:07

Mining Diverse Patterns的相关文章

数据挖掘文章翻译--Mining Emerging Patterns by Streaming Feature Selection

学习数据挖掘,可以用到的工具-机器学习,SPSS(IBM),MATLAB,HADOOP,建议业余时间都看文章,扩充视野,下面是本人翻译的一篇文章,供大家学习.另外,本人感兴趣的领域是机器学习,大数据,目标跟踪方面,有兴趣的可以互相学习一下,本人Q Q邮箱 657831414.,word格式翻译和理解可以发邮件 " 原文题目是Mining Emerging Patterns by Streaming Feature Selection 通过流特征的选择挖掘显露模式 俞奎,丁薇,Dan A. Sim

GSP Algorithm: Sequence Mining.

参考论文:  Srikant R, Agrawal R.Mining sequential patterns: Generalizations and performance improvements[M].Springer Berlin Heidelberg, 1996. 1. 参考论文描述: Srikant R, Agrawal R. 提出的算法,在原有aporiori的基础上,引入了3个新的概念来定义频繁模式子序列: 1) 加入时间约束,使得原有的aporiori关注的连续变成了只要满足m

Efficient Pattern Mining Methods

Efficient Pattern Mining Methods @(Pattern Discovery in Data Mining) 本文介绍了几个模式挖掘的高效算法.主要以Apriori思想为框架,主要讲解了FP-Growth算法. The Downward Closure Property of Frequent Patterns Property The downward closure (also called "Apriori") property of frequent

关联规则之频繁模式树及其并行计算

Frequent Pattern Tree(频繁模式树)是Jiawei Han在文章<Mining Frequent Patterns without Candidate Generation >中提出的. ---------------------------------------------------- 下面给出一些定义: 设项集(set of items),交易数据库(transaction database),其中交易(transaction),,是 中的元素组成的集合.模式(Pa

数据挖掘领域最有影响力的18个算法(转载)

Classification ================================== #1. C4.5 Quinlan, J. R. 1993. C4.5: Programs for Machine Learning.Morgan Kaufmann Publishers Inc. Google Scholar Count in October 2006: 6907 #2. CART L. Breiman, J. Friedman, R. Olshen, and C. Stone.

大数据流式计算:关键技术及系统实例

孙大为1, 张广艳1,2, 郑纬民1 摘要:大数据计算主要有批量计算和流式计算两种形态,目前,关于大数据批量计算系统的研究和讨论相对充分,而如何构建低延迟.高吞吐且持续可靠运行的大数据流式计算系统是当前亟待解决的问题且研究成果和实践经验相对较少.总结了典型应用领域中流式大数据所呈现出的实时性.易失性.突发性.无序性.无限性等特征,给出了理想的大数据流式计算系统在系统结构.数据传输.应用接口.高可用技术等方面应该具有的关键技术特征,论述并对比了已有的大数据流式计算系统的典型实例,最后阐述了大数据流

Pattern Discovery Basic Concepts

Pattern Discovery Basic Concepts @(Pattern Discovery in Data Mining)[Pattern Discovery] 本文介绍了基本的模式挖掘的概念 Pattern: A set of items, subsequences, or substructures that occur frequently together (or strongly correlated) in a data set. Motivation to do pa

关联规则( Association Rules)之频繁模式树(FP-Tree)

Frequent Pattern Tree(频繁模式树)是Jiawei Han在2004年的文章<Mining Frequent Patterns without Candidate Generation >中提出的. ---------------------------------------------------- 以下给出一些定义: 设项集(set of items),交易数据库(transaction database).当中交易(transaction).,是中的元素组成的集合.

关联规则FpGrowth算法

Aprori算法利用频繁集的两个特性,过滤了很多无关的集合,效率提高不少,但是我们发现Apriori算法是一个候选消除算法,每一次消除都需要扫描一次所有数据记录,造成整个算法在面临大数据集时显得无能为力.今天我们介绍一个新的算法挖掘频繁项集,效率比Aprori算法高很多. FpGrowth算法通过构造一个树结构来压缩数据记录,使得挖掘频繁项集只需要扫描两次数据记录,而且该算法不需要生成候选集合,所以效率会比较高.我们还是以上一篇中用的数据集为例: TID Items T1 {牛奶,面包} T2