relevance: 主要用在互联网的内容和文档上,比如搜索引擎算法文档中之间的关联性。
association: 用在实际的事物之上,比如电子商务网站上的商品之间的关联度。
置信度(Confidence):在数据集中已经出现A时,B发生的概率,置信度的计算公式是 :A与B同时出现的概率/A出现的概率。
关联规则挖掘的一个典型例子是购物篮分析(MBA,Market Basket Analysis)。关联规则研究有助于发现交易数据库中不同商品(项)之间的联系,找出顾客购买行为模式,如购买了某一商品对购买其他商品的影响。分析结果可以应用于商品货架布局、货存安排以及根据购买模式对用户进行分类。
第一步是迭代识别所有的频繁项目集(Frequent Itemsets),要求频繁项目集的支持度不低于用户设定的最低值;
第一阶段必须从原始资料集合中,找出所有高频项目组(Large Itemsets)。高频的意思是指某一项目组出现的频率相对于所有记录而言,必须达到某一水平。以一个包含A与B两个项目的2-itemset为例,我们可以求得包含{A,B}项目组的支持度,若支持度大于等于所设定的最小支持度(Minimum Support)门槛值时,则{A,B}称为高频项目组。一个满足最小支持度的k-itemset,则称为高频k-项目组(Frequent k-itemset),一般表示为Large k或Frequent k。算法并从Large k的项目组中再试图产生长度超过k的项目集Large k+1,直到无法再找到更长的高频项目组为止。
关联规则挖掘的第二阶段是要产生关联规则。从高频项目组产生关联规则,是利用前一步骤的高频k-项目组来产生规则,在最小可信度(Minimum Confidence)的条件门槛下,若一规则所求得的可信度满足最小可信度,则称此规则为关联规则。
就“啤酒+尿布”这个案例而言,使用关联规则挖掘技术,对交易资料库中的记录进行资料挖掘,首先必须要设定最小支持度与最小可信度两个门槛值,在此假设最小支持度min-support=5% 且最小可信度min-confidence=65%。因此符合需求的关联规则将必须同时满足以上两个条件。若经过挖掘所找到的关联规则 {尿布,啤酒}满足下列条件,将可接受{尿布,啤酒} 的关联规则。用公式可以描述为:
Support(尿布,啤酒)≥5% and Confidence(尿布,啤酒)≥65%。
Apiorio 算法
如何理解皮尔逊相关系数(Pearson Correlation Coefficient)?
private double doItemSimilarity(long itemID1, long itemID2, long preferring1, long numUsers) throws TasteException {
DataModel dataModel = getDataModel();
long preferring1and2 = dataModel.getNumUsersWithPreferenceFor(itemID1, itemID2);
if (preferring1and2 == 0) {
return Double.NaN;
long preferring2 = dataModel.getNumUsersWithPreferenceFor(itemID2);
double logLikelihood =
preferring2 - preferring1and2,
preferring1 - preferring1and2,
numUsers - preferring1 - preferring2 + preferring1and2);
return 1.0 - 1.0 / (1.0 + logLikelihood);
long preferring1and2 = dataModel.getNumUsersWithPreferenceFor(itemID1, itemID2);
long preferring1 = dataModel.getNumUsersWithPreferenceFor(itemID1);
long preferring2 = dataModel.getNumUsersWithPreferenceFor(itemID2);
long numUsers = dataModel.getNumUsers();
k11: preferring1and2
k12: preferring2 - preferring1and2
k21: preferring1 - preferring1and2
k22: numUsers - preferring1 - preferring2 + preferring1and2
Event A | Everything but A |
Event B | k11 |
Everything but B | k21 |
LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))
H = function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0)))}
* Calculates the Raw Log-likelihood ratio for two events, call them A and B. Then we have:
* <p/>
* <table border="1" cellpadding="5" cellspacing="0">
* <tbody><tr><td> </td><td>Event A</td><td>Everything but A</td></tr>
* <tr><td>Event B</td><td>A and B together (k_11)</td><td>B, but not A (k_12)</td></tr>
* <tr><td>Everything but B</td><td>A without B (k_21)</td><td>Neither A nor B (k_22)</td></tr></tbody>
* </table>
* @param k11 The number of times the two events occurred together
* @param k12 The number of times the second event occurred WITHOUT the first event
* @param k21 The number of times the first event occurred WITHOUT the second event
* @param k22 The number of times something else occurred (i.e. was neither of these events
* @return The raw log-likelihood ratio
* <p/>
* Credit to http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html for the table and the descriptions.
public static double logLikelihoodRatio(long k11, long k12, long k21, long k22) {
Preconditions.checkArgument(k11 >= 0 && k12 >= 0 && k21 >= 0 && k22 >= 0);
// note that we have counts here, not probabilities, and that the entropy is not normalized.
double rowEntropy = entropy(k11 + k12, k21 + k22);
double columnEntropy = entropy(k11 + k21, k12 + k22);
double matrixEntropy = entropy(k11, k12, k21, k22);
if (rowEntropy + columnEntropy < matrixEntropy) {
// round off error
return 0.0;
return 2.0 * (rowEntropy + columnEntropy - matrixEntropy);
* Merely an optimization for the common two argument case of {@link #entropy(long...)}
* @see #logLikelihoodRatio(long, long, long, long)
private static double entropy(long a, long b) {
return xLogX(a + b) - xLogX(a) - xLogX(b);
Information retrieval
Entropy (information theory)
Mahout Recommender Document: non-distributed
Mahout on Spark: What’s New in Recommenders
Mahout on Spark: What’s New in Recommenders, part 2
Intro to Cooccurrence Recommenders with Spark
Mahout: Scala & Spark Bindings
How to create and App using Mahout
FAQ for using Mahout with Spark
Mahout on Spark: What’s New in Recommenders, part 2
Here similar means that they were liked by the same people. We’ll use another technique to narrow the items down to ones of the same genre later.
Intro to Cooccurrence Recommenders with Spark
rp = recommendations for a given user
hp = history of purchases for a given user
A = the matrix of all purchases by all users
rp = [A^tA]hp
This would produce reasonable recommendations, but is subject to skewed results due to the dominance of popular items. To avoid that, we can apply a weighting called the log likelihood ratio (LLR), which is a probabilistic measure of the importance of a cooccurrence.
The magnitude of the value in the matrix determines the strength of similarity of row item to the column item. We can use the LLR weights as a similarity measure that is nicely immune to unimportant similarities.
Creating the indicator matrix [AtA] is the core of this type of recommender. We have a quick flexible way to create this using text log files and creating output that’s in an easy form to digest. The job of data prep is greatly streamlined in the Mahout 1.0 snapshot. In the past a user would have to do all the data prep themselves. Translating their own user and item ids into Mahout ids, putting the data into text files, one element per line, and feeding them to the recommender. Out the other end you’d get a Hadoop binary file called a sequence file and you’d have to translate the Mahout ids into something your application could understand. No more.
Part 4: Tuning Your Recommender
What you wanted to know about Mean Average Precision
《基于mahout on spark + elastic search搭建item推荐系统》