Pattern Discovery Basic Concepts

@(Pattern Discovery in Data Mining)[Pattern Discovery]

本文介绍了基本的模式挖掘的概念

Pattern: A set of items, subsequences, or substructures that occur

frequently together (or strongly correlated) in a data set.

Motivation to do pattern discovery in data:

* To find what may be bought after one/some goods by customer;

* To find what code segment may likely contain copy/paste bugs;

* To find what kind of events may happen after some news posted;

* What products were often purchased together?

* What are the subsequent purchases after buying an iPad?

* What code segments likely contain copy-and-paste bugs?

* What word sequences likely form phrases in this corpus?

* …

In conclusion, pattern discovery is important because

* Finding inherent regularities in a data set

* Foundation for many essential data mining tasks

* Association, correlation, and causality analysis

* Mining sequential, structural (e.g., sub-graph) patterns

* Pattern analysis in spatiotemporal, multimedia, time-series, and stream data

* Classification: Discriminative pattern-based analysis

* Cluster analysis: Pattern-based subspace clustering

* Broad applications

* Market basket analysis, cross-marketing, catalog design, sale campaign analysis, Web log analysis, biological sequence analysis

TODO: 上述具体应用

Frequent Pattern and Association Rule

Itemset: A set of one or more items

k-itemset: X=x1,...,xk

(absolute) support (count) of X: Frequency or the number of occurrences of an itemset X

(relative) support, s: The fraction of transactions that contains X (i.e., the probability that a transaction contains X)

frequent pattern: An itemset X is frequent if the support of X is no less than a minsup threshold (denoted as σ)

association rule: X→Y(s,c)

* support s: The probability that a transaction contains X∪Y.

* confidence c: The conditional probability that a transaction containing X also contains Y

* c(X→Y)=sup(X∪Y)/sup(X)

Association rule mining: Find all of the rules, X→Y, with minimum support and confidence.

Drawbacks of Frequent Pattern: too many

So we need a compression method.

Closed Pattern & Max Pattern

Closed patterns: A pattern (itemset) X is closed if X is frequent, and there exists no super-pattern Y?X, with the same support as X.

* Closed pattern is a lossless compression of frequent patterns

* Reduces the # of patterns but does not lose the support information!

Notion: Here lossless means that given the set of closed frequent patterns, we can not only find the set of max frequent patterns, but also recover the set of all frequent patterns and their support.

Max-patterns: A pattern X is a max-pattern if X is frequent and there exists no frequent super-pattern Y?X

* Max-pattern is a lossy compression!

Frequent Pattern	Support	closed pattern	max pattern
Beer, Nuts, Diaper	10	Y	N
Beer, Coffee, Diaper, Nuts	20	Y	Y
Beer, Diaper, Eggs	30	N	N
Beer, Nuts, Eggs, Milk	40	Y	N
Beer, Nuts, Diaper, Eggs, Milk	30	Y	Y

Pattern Discovery Basic Concepts

Pattern Discovery Basic Concepts

Frequent Pattern and Association Rule

Closed Pattern & Max Pattern

Recommended Readings

Pattern Discovery Basic Concepts的相关文章

Introduction and Basic concepts

Basic Concepts of Block Media Recovery

In-memory Computing with SAP HANA读书笔记 - 第一章：Basic concepts of in-memory

Basic Concepts 基本概念（二）

Nginx Tutorial #1: Basic Concepts（转）

[Network]Introduction and Basic concepts

HTML5 Basic Concepts

(C/C++) Interview in English - Basic concepts.

Basic Concepts in OS X Operation System（OSX系统的一些基本概念），准确地说是mach内核的一些基本概念