Pattern Discovery Basic Concepts
@(Pattern Discovery in Data Mining)[Pattern Discovery]
本文介绍了基本的模式挖掘的概念
Pattern: A set of items, subsequences, or substructures that occur
frequently together (or strongly correlated) in a data set.
Motivation to do pattern discovery in data:
* To find what may be bought after one/some goods by customer;
* To find what code segment may likely contain copy/paste bugs;
* To find what kind of events may happen after some news posted;
* What products were often purchased together?
* What are the subsequent purchases after buying an iPad?
* What code segments likely contain copy-and-paste bugs?
* What word sequences likely form phrases in this corpus?
* …
In conclusion, pattern discovery is important because
* Finding inherent regularities in a data set
* Foundation for many essential data mining tasks
* Association, correlation, and causality analysis
* Mining sequential, structural (e.g., sub-graph) patterns
* Pattern analysis in spatiotemporal, multimedia, time-series, and stream data
* Classification: Discriminative pattern-based analysis
* Cluster analysis: Pattern-based subspace clustering
* Broad applications
* Market basket analysis, cross-marketing, catalog design, sale campaign analysis, Web log analysis, biological sequence analysis
TODO: 上述具体应用
Frequent Pattern and Association Rule
Itemset: A set of one or more items
k-itemset: X=x1,...,xk
(absolute) support (count) of X: Frequency or the number of occurrences of an itemset X
(relative) support, s
: The fraction of transactions that contains X (i.e., the probability that a transaction contains X)
frequent pattern: An itemset X is frequent if the support of X is no less than a minsup
threshold (denoted as σ)
association rule: X→Y(s,c)
* support s
: The probability that a transaction contains X∪Y.
* confidence c
: The conditional probability that a transaction containing X also contains Y
* c(X→Y)=sup(X∪Y)/sup(X)
Association rule mining: Find all of the rules, X→Y, with minimum support and confidence.
Drawbacks of Frequent Pattern: too many
So we need a compression method.
Closed Pattern & Max Pattern
Closed patterns: A pattern (itemset) X is closed if X is frequent, and there exists no super-pattern Y?X, with the same support as X.
* Closed pattern is a lossless compression of frequent patterns
* Reduces the # of patterns but does not lose the support information!
Notion
: Here lossless means that given the set of closed frequent patterns, we can not only find the set of max frequent patterns, but also recover the set of all frequent patterns and their support.
Max-patterns: A pattern X is a max-pattern if X is frequent and there exists no frequent super-pattern Y?X
* Max-pattern is a lossy compression!
Frequent Pattern | Support | closed pattern | max pattern |
---|---|---|---|
Beer, Nuts, Diaper | 10 | Y | N |
Beer, Coffee, Diaper, Nuts | 20 | Y | Y |
Beer, Diaper, Eggs | 30 | N | N |
Beer, Nuts, Eggs, Milk | 40 | Y | N |
Beer, Nuts, Diaper, Eggs, Milk | 30 | Y | Y |
Recommended Readings
R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items in large databases”, in Proc. of SIGMOD’93
R. J. Bayardo, “Efficiently mining long patterns from databases”, in Proc. of SIGMOD’98
N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, “Discovering frequent closed itemsets for association rules”, in Proc. of ICDT’99
J. Han, H. Cheng, D. Xin, and X. Yan, “Frequent Pattern Mining: Current Status and Future Directions”, Data Mining and Knowledge Discovery, 15(1): 55-86, 2007