Pattern Discovery Basic Concepts

Pattern Discovery Basic Concepts

@(Pattern Discovery in Data Mining)[Pattern Discovery]

本文介绍了基本的模式挖掘的概念

Pattern: A set of items, subsequences, or substructures that occur

frequently together (or strongly correlated) in a data set.

Motivation to do pattern discovery in data:

* To find what may be bought after one/some goods by customer;

* To find what code segment may likely contain copy/paste bugs;

* To find what kind of events may happen after some news posted;

* What products were often purchased together?

* What are the subsequent purchases after buying an iPad?

* What code segments likely contain copy-and-paste bugs?

* What word sequences likely form phrases in this corpus?

* …

In conclusion, pattern discovery is important because

* Finding inherent regularities in a data set

* Foundation for many essential data mining tasks

* Association, correlation, and causality analysis

* Mining sequential, structural (e.g., sub-graph) patterns

* Pattern analysis in spatiotemporal, multimedia, time-series, and stream data

* Classification: Discriminative pattern-based analysis

* Cluster analysis: Pattern-based subspace clustering

* Broad applications

* Market basket analysis, cross-marketing, catalog design, sale campaign analysis, Web log analysis, biological sequence analysis

TODO: 上述具体应用

Frequent Pattern and Association Rule

Itemset: A set of one or more items

k-itemset: X=x1,...,xk

(absolute) support (count) of X: Frequency or the number of occurrences of an itemset X

(relative) support, s: The fraction of transactions that contains X (i.e., the probability that a transaction contains X)

frequent pattern: An itemset X is frequent if the support of X is no less than a minsup threshold (denoted as σ)

association rule: X→Y(s,c)

* support s: The probability that a transaction contains X∪Y.

* confidence c: The conditional probability that a transaction containing X also contains Y

* c(X→Y)=sup(X∪Y)/sup(X)

Association rule mining: Find all of the rules, X→Y, with minimum support and confidence.

Drawbacks of Frequent Pattern: too many

So we need a compression method.

Closed Pattern & Max Pattern

Closed patterns: A pattern (itemset) X is closed if X is frequent, and there exists no super-pattern Y?X, with the same support as X.

* Closed pattern is a lossless compression of frequent patterns

* Reduces the # of patterns but does not lose the support information!

Notion: Here lossless means that given the set of closed frequent patterns, we can not only find the set of max frequent patterns, but also recover the set of all frequent patterns and their support.

Max-patterns: A pattern X is a max-pattern if X is frequent and there exists no frequent super-pattern Y?X

* Max-pattern is a lossy compression!

Frequent Pattern Support closed pattern max pattern
Beer, Nuts, Diaper 10 Y N
Beer, Coffee, Diaper, Nuts 20 Y Y
Beer, Diaper, Eggs 30 N N
Beer, Nuts, Eggs, Milk 40 Y N
Beer, Nuts, Diaper, Eggs, Milk 30 Y Y

Recommended Readings

R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items in large databases”, in Proc. of SIGMOD’93

R. J. Bayardo, “Efficiently mining long patterns from databases”, in Proc. of SIGMOD’98

N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, “Discovering frequent closed itemsets for association rules”, in Proc. of ICDT’99

J. Han, H. Cheng, D. Xin, and X. Yan, “Frequent Pattern Mining: Current Status and Future Directions”, Data Mining and Knowledge Discovery, 15(1): 55-86, 2007

时间: 2024-08-13 17:48:04

Pattern Discovery Basic Concepts的相关文章

Introduction and Basic concepts

1 Network Edge The device such as computers and mobiles connect to the Internet. So they are referred as end systems(who run the application programs) sitting at the edge of the Internet. And we use host and end system interchangeably, that is host=e

Basic Concepts of Block Media Recovery

Basic Concepts of Block Media Recovery Whenever block corruption has been automatically detected, you can perform block media recovery manually with the RECOVER ... BLOCK command. By default, RMAN first searches for good blocks in the real-time query

In-memory Computing with SAP HANA读书笔记 - 第一章:Basic concepts of in-memory

本文为In-memory Computing with SAP HANA on Lenovo X6 Systems第一章Basic concepts of in-memory computing的读书笔记. 作为基础概念,本章非常重要.此Redbook讲得浅显易懂,配图也容易理解.唯一需要深读是DL ACM的那篇论文,后续我会再补充. "卑之,毋甚高论,令今可行也", 本章正符合汉文帝对于张释之的要求. Basic concepts of in-memory computing In-

Basic Concepts 基本概念(二)

Basic Concepts There are a few concepts that are core to Elasticsearch. Understanding these concepts from the outset will tremendously help ease the learning process. 有一些概念是Elasticsearch的核心.从一开始就理解这些概念将极大地帮助简化学习过程. Near Realtime (NRT) Elasticsearch i

Nginx Tutorial #1: Basic Concepts(转)

add by zhj: 文章写的很好,适合初学者 原文:https://www.netguru.com/codestories/nginx-tutorial-basics-concepts Introduction Hello! Sharing is caring, so we'd love to share another piece of knowledge with you. We prepared a three-part nginx tutorial. If you already k

[Network]Introduction and Basic concepts

[这个系列是复习计算机网路的知识. 因为立即要申请出国了,所以在写这个系列的博客时大部分使用英文. 尽管是英文.但绝大部分内容都是我自己的感受和理解,供大家一起学习和讨论] 1 Network Edge The device such as computers and mobiles connect to the Internet. So they are referred as end systems(who run the application programs) sitting at t

HTML5 Basic Concepts

1. 关于编程习惯. 在查看网页源代码时(推荐firefox配置的firebug),良好的编程习惯可以让我们对代码结构有一个更好的了解,在读懂别人的代码或者debug的时候更能找到问题所在. <!DOCTYPE html> <html> //sheng <head> <meta charset = "UTF-8"> <title>HelloWorld!</title> <script>...</s

(C/C++) Interview in English - Basic concepts.

Question Key words Anwser A assignment operator abstract class It is a class that has one or more pure virtual functions. assignment & initialization constructed -> change value ,Same time Assignment changes the value of the object that has already

Basic Concepts in OS X Operation System(OSX系统的一些基本概念),准确地说是mach内核的一些基本概念

TasksA task is a logical representation of an execution environment. Tasks are usedin order to divide system resources between each running program. Each taskhas its own virtual address space and privilege level. Each task contains one ormore threads