Spark FPGrowth (Frequent Pattern Mining)

给定交易数据集,FP增长的第一步是计算项目频率并识别频繁项目。与为同样目的设计的类似Apriori的算法不同,FP增长的第二步使用后缀树(FP-tree)结构来编码事务,而不会显式生成候选集,生成的代价通常很高。第二步之后,可以从FP树中提取频繁项集。

import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.fpm.FPGrowth
import org.apache.spark.rdd.RDD

val spark = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config("spark.some.config.option", "some-value")
      .getOrCreate()

// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._

val data = List(
            "1,2,5",
            "1,2,3,5",
            "1,2").toDF("items")
data: org.apache.spark.sql.DataFrame = [items: string]

// 注意每行,头部和尾部的[中括号
 data.rdd.map { s => s.toString() }.collect().take(3)
res20: Array[String] = Array([1,2,5], [1,2,3,5], [1,2])                         

val transactions: RDD[Array[String]] = data.rdd.map {
            s =>
              val str = s.toString().drop(1).dropRight(1)
              str.trim().split(",")
          }

val fpg = new FPGrowth().setMinSupport(0.5).setNumPartitions(8)

val model = fpg.run(transactions)

/* model.freqItemsets.collect().foreach { itemset =>
            println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)
          }*/

val freqItemSets = model.freqItemsets.map { itemset =>
            val items = itemset.items.mkString(",")
            val freq = itemset.freq
            (items, freq)
          }.toDF("items", "freq")
freqItemSets: org.apache.spark.sql.DataFrame = [items: string, freq: bigint]

freqItemSets.show
+-----+----+
|items|freq|
+-----+----+
|    1|   3|
|    2|   3|
|  2,1|   3|
|    5|   2|
|  5,2|   2|
|5,2,1|   2|
|  5,1|   2|
+-----+----+

val minConfidence = 0.6
minConfidence: Double = 0.6

/*model.generateAssociationRules(minConfidence).collect().foreach { rule =>
            println(
              rule.antecedent.mkString("[", ",", "]")
                + " => " + rule.consequent.mkString("[", ",", "]")
                + ", " + rule.confidence)
          }*/

// 根据置信度生成关联规则
val Rules = model.generateAssociationRules(minConfidence)
Rules: org.apache.spark.rdd.RDD[org.apache.spark.mllib.fpm.AssociationRules.Rule[String]] = MapPartitionsRDD[129] at filter at AssociationRules.scala:80

val df = Rules.map { s =>
            val L = s.antecedent.mkString(",")
            val R = s.consequent.mkString(",")
            val confidence = s.confidence
            (L, R, confidence)
          }.toDF("left_collect", "right_collect", "confidence")
df: org.apache.spark.sql.DataFrame = [left_collect: string, right_collect: string ... 1 more field]

df.show
+------------+-------------+------------------+
|left_collect|right_collect|        confidence|
+------------+-------------+------------------+
|           2|            5|0.6666666666666666|
|           2|            1|               1.0|
|         5,2|            1|               1.0|
|           5|            2|               1.0|
|           5|            1|               1.0|
|           1|            5|0.6666666666666666|
|           1|            2|               1.0|
|         2,1|            5|0.6666666666666666|
|         5,1|            2|               1.0|
+------------+-------------+------------------+
时间: 2024-07-30 18:55:40

Spark FPGrowth (Frequent Pattern Mining)的相关文章

Efficient Pattern Mining Methods

Efficient Pattern Mining Methods @(Pattern Discovery in Data Mining) 本文介绍了几个模式挖掘的高效算法.主要以Apriori思想为框架,主要讲解了FP-Growth算法. The Downward Closure Property of Frequent Patterns Property The downward closure (also called "Apriori") property of frequent

Constraint-Based Pattern Mining

在数据挖掘中,如何进行有约束地挖掘,如何对待挖掘数据进行条件约束与筛选,是本文探讨的话题. Why do we use constraint-based pattern mining? Because we'd like to apply different pruning methods to constrain pattern mining process. And for those reasons: Finding all the patterns in a dataset autono

Frequent Pattern 挖掘之一(Aprior算法)(转)

数据挖掘中有一个很重要的应用,就是Frequent Pattern挖掘,翻译成中文就是频繁模式挖掘.这篇博客就想谈谈频繁模式挖掘相关的一些算法. 定义 何谓频繁模式挖掘呢?所谓频繁模式指的是在样本数据集中频繁出现的模式.举个例子,比如在超市的交易系统中,记载了很多次交易,每一次交易的信息包括用户购买的商品清单.如果超市主管是个有心人的话,他会发现尿不湿,啤酒这两样商品在许多用户的购物清单上都出现了,而且频率非常高.尿不湿,啤酒同时出现在一张购物单上就可以称之为一种频繁模式,这样的发掘就可以称之为

Frequent Pattern 挖掘之二(FP Growth算法)(转)

FP树构造 FP Growth算法利用了巧妙的数据结构,大大降低了Aproir挖掘算法的代价,他不需要不断得生成候选项目队列和不断得扫描整个数据库进行比对.为了达到这样的效果,它采用了一种简洁的数据结构,叫做frequent-pattern tree(频繁模式树).下面就详细谈谈如何构造这个树,举例是最好的方法.请看下面这个例子: 这张表描述了一张商品交易清单,abcdefg代表商品,(ordered)frequent items这一列是把商品按照降序重新进行了排列,这个排序很重要,我们操作的所

论文总结(Frequent Itemsets Mining With Differential Privacy Over Large-Scale Data)

一.论文目标:将差分隐私和频繁项集挖掘结合,主要针对大规模数据. 二.论文的整体思路: 1)预处理阶段: 对于大的数据集,进行采样得到采样数据集并计算频繁项集,估计样本数据集最大长度限制,然后再缩小源数据集:(根据最小的support值,频繁项集之外的项集从源数据集移除)     我们利用字符串匹配去剪切数据集的事务: 2)挖掘阶段: 利用压缩数据集,先构造FP-Tree,隐私预算均匀分配,对真实的结果添加噪声: 3)扰动阶段: 对于候选频繁项集添加拉普拉斯噪声并且输出 通过限制每个事务的长度减

Spark之MLlib

目录 Part VI. Advanced Analytics and Machine Learning Advanced Analytics and Machine Learning Overview 1.A Short Primer on Advanced Analytics 2.Spark's Advanced Analytics Toolkit 3.ML in Action 4.部署模式 Preprocessing and Feature Engineering 1.Formatting

Pattern Discovery Basic Concepts

Pattern Discovery Basic Concepts @(Pattern Discovery in Data Mining)[Pattern Discovery] 本文介绍了基本的模式挖掘的概念 Pattern: A set of items, subsequences, or substructures that occur frequently together (or strongly correlated) in a data set. Motivation to do pa

Introduction - Notes of Data Mining

Introduction @(Pattern Discovery in Data Mining)[Data Mining, Notes] Jiawei Han的Pattern Discovery课程笔记 Why data mining? data explosion and abundant(but unstructured) data everywhere drowning in data but starving in knowledge keyword: interdisciplinary

数据挖掘方面重要会议的最佳paper集合

数据挖掘方面重要会议的最佳paper集合,后续将陆续分析一下内容: 主要有KDD.SIGMOD.VLDB.ICML.SIGIR KDD (Data Mining) 2013 Simple and Deterministic Matrix Sketching Edo Liberty, Yahoo! Research 2012 Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping T