Five most popular similarity measures (转)

Five most popular similarity measures implementation in python

The buzz term similarity distance measures has got wide variety of definitions among the math and data mining practitioners. As a result those terms, concepts and their usage went way beyond the head for beginner , who  started to understand them for the very first time. So today I write this post to give more simplified and very intuitive definitions for similarity and i will drive to Five most popular similarity measures and implementation of them.

Before going to explain different similarity distance measures let me explain the the effective key term similarity in datamining. This similarity is the very basic building block for activities such as Recommendation engines, clustering, classification and anomaly detection.

Similarity:

Similarity is the measure of how much alike two data objects are. Similarity in a data mining context is usually described as a distance with dimensions representing features of the objects. If this distance is small it will be high degree of similarity where as large distance will be low degree of similarity.Similarity is subjective and is highly dependant on the domain and application. For example two fruits are similar because of color or size or taste. Care should be taken when calculating distance across dimensions/features that are unrelated. The relative values of each feature must be normalized or one feature could end up dominating the distance calculation. Similarity are measure in the range 0 to 1 [0,1].

Two main consideration about similarity:

  • Similarity = 1 if X = Y         (Where X, Y are two objects)
  • Similarity = 0 if X ≠ Y

That’s all about similarity let’s drive to five most popular similarity distance measures.

Euclidean distance:

Euclidean distance is the most common use of distance. In most cases when people said about distance, they will refer to Euclidean distance. Euclidean distance  is also know as simply distance. When data is dense or continuous , this is the best proximity measure. The Euclidean distance between two points is the length of the path connecting them.This distance between two points is given by the Pythagorean theorem.

Euclidean distance implementation in python:

#!/usr/bin/env python

from math import*

def euclidean_distance(x,y):

  return sqrt(sum(pow(a-b,2) for a, b in zip(x, y)))

print euclidean_distance([0,3,4,5],[7,6,3,-1])

Script Output:

9.74679434481
[Finished in 0.0s]

Manhattan distance:

Manhattan distance is an metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates. In simple way of saying it is the absolute sum of difference between the x-coordinates  and y-coordinates. Suppose we have two point A and B if we want to find the manhattan distane between them, just we have to sum up the absultue x-axis and y – axis variation means we have to find how these to points A and B are varining in X-axis and Y- axis.In more mathematical way of saying Manhattan distance between two points measured along axes at right angles.

In a plane with p1 at (x1, y1) and p2 at (x2, y2).

Manhattan distance = |x1 – x2| + |y1 – y2|

This Manhattan distance metric is also known as Manhattan length,rectilinear distance, L1 distance or L1 norm ,city block distance,Minkowski’s L1 distance,taxi cab metric, or city block distance.

Manhattan distance implementation in python:

#!/usr/bin/env python

from math import*

def manhattan_distance(x,y):

  return sum(abs(a-b) for a,b in zip(x,y))

print manhattan_distance([10,20,10],[10,20,20])

Script Output:

10
[Finished in 0.0s]

Minkowski distance:

The Minkowski distance is a generalized metric form of Euclidean distance and Manhattan distance.

In the equation d^MKD is the Minkowski distance between the data record i and j, k the index of a variable, n the total number of variables y and λ the order of the Minkowski metric. Although it is defined for any λ > 0, it is rarely used for values other than 1, 2 and ∞.

The way distances are measured by the Minkowski metric of different orders between two objects with three variables ( In the image it displayed in a coordinate system with x, y ,z-axes).

Synonyms of Minkowski:
Different names for the Minkowski distance or Minkowski metric arise form the order:

  • λ = 1 is the Manhattan distance. Synonyms are L1-Norm, Taxicab or City-Block distance. For two vectors of ranked ordinal variables the Manhattan distance is sometimes called Foot-ruler distance.
  • λ = 2 is the Euclidean distance. Synonyms are L2-Norm or Ruler distance. For two vectors of ranked ordinal variables the Euclidean distance is sometimes called Spear-man distance.
  • λ = ∞ is the Chebyshev distance. Synonym are Lmax-Norm or Chessboard distance.
    reference.

 Minkowski distance implementation in python:

#!/usr/bin/env python

from math import*
from decimal import Decimal

def nth_root(value, n_root):

 root_value = 1/float(n_root)
 return round (Decimal(value) ** Decimal(root_value),3)

def minkowski_distance(x,y,p_value):

 return nth_root(sum(pow(abs(a-b),p_value) for a,b in zip(x, y)),p_value)

print minkowski_distance([0,3,4,5],[7,6,3,-1],3)

Script Output:

8.373
[Finished in 0.0s]

Cosine similarity:

Cosine similarity metric finds the normalized dot product of the two attributes. By determining the cosine similarity, we will effectively trying to find cosine of the angle between the two objects. The cosine of 0° is 1, and it is less than 1 for any other angle. It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1]. One of the reasons for the popularity of cosine similarity is that it is very efficient to evaluate, especially for sparse vectors.

Cosine similarity implementation in python:

#!/usr/bin/env python

from math import*

def square_rooted(x):

   return round(sqrt(sum([a*a for a in x])),3)

def cosine_similarity(x,y):

 numerator = sum(a*b for a,b in zip(x,y))
 denominator = square_rooted(x)*square_rooted(y)
 return round(numerator/float(denominator),3)

print cosine_similarity([3, 45, 7, 2], [2, 54, 13, 15])

Script Output:

0.972
[Finished in 0.1s]

Jaccard similarity:

We so far discussed some metrics to find the similarity between objects. where the objects are points or vectors .When we consider about jaccard similarity this objects will be sets. So first let’s learn some very basic about sets.

Sets:

A set is (unordered) collection of objects {a,b,c}. we use the notation as elements separated by commas inside curly brackets { }. They are unordered so {a,b} = { b,a }.

Cardinality:

Cardinality of A denoted by |A| which counts how many elements are in A.

Intersection:

Intersection between two sets A and B is denoted A ∩ B and reveals all items which are in both sets A,B.

Union:

Union between two sets A and B is denoted A ∪ B and reveals all items which are in either set.

Now going back to Jaccard similarity.The Jaccard similarity measures similarity between finite sample sets, and is defined as the cardinality of the intersection of sets divided by the cardinality of the union of the sample sets. Suppose you want to find jaccard similarity between two sets A and B it is the ration of cardinality of A ∩ B and A ∪ B

Jaccard similarity implementation:

#!/usr/bin/env python

from math import*

def jaccard_similarity(x,y):

 intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
 union_cardinality = len(set.union(*[set(x), set(y)]))
 return intersection_cardinality/float(union_cardinality)

print jaccard_similarity([0,1,2,5,6],[0,2,3,5,7,9])

Script Output:

0.375
[Finished in 0.0s]

(源:http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/)
时间: 2024-10-28 20:41:22

Five most popular similarity measures (转)的相关文章

Learning string similarity measures for gene/protein name dictionary look-up using logistic regression

Yoshimasa Tsuruoka1,*, John McNaught1,2, Jun'ichi Tsujii1,2,3 and Sophia Ananiadou1,2 1 School of Computer Science, The University of Manchester, Manchester, 2 National Centre for Text Mining (NaCTeM), Manchester, UK and 3 Department of Computer Scie

Correlation based similarity measures-Summary

Correlation based matching typically produces dense depth maps by calculating the disparity at each pixel within a neighborhood. This is achieved by taking a square window of certain size around the pixel of interest in the reference image and findin

机器学习算法之旅(转载)

http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/ In this post, we take a tour of the most popular machine learning algorithms. It is useful to tour the main algorithms in the field to get a feeling of what methods are availabl

机器学习算法之旅A Tour of Machine Learning Algorithms

In this post we take a tour of the most popular machine learning algorithms. It is useful to tour the main algorithms in the field to get a feeling of what methods are available. There are so many algorithms available and it can feel overwhelming whe

国际流行开源机器学习和模式识别工具(转)

机器学习和数据挖掘最近几年有很大突破且实用推进很快.在众多数据中,声音和图像就占据绝大部分,以声音和图像为内容源的机器学习和数据挖掘会越来越多,所以声音分析,例如音乐分析等; 图像识别等会越来越重要. 国外很多相关的开源项目是我们好好学习第一手资料,我们在基本算法原理清楚的情况下,可以好好学习这些opensource Machine Learning  M2K - M2K represents the music-specific set of D2K modules designed to c

五.Classification (I) – Tree, Lazy, and Probabilisti

五.Classification (I) – Tree, Lazy, and Probabilistic 五.分类(I)-树,延迟,和概率 In this chapter, we will cover the following recipes: 在本章中,我们将介绍下列菜谱:准备培训和测试数据集 1.Preparing the training and testing datasets Introduction 介绍 Classification is used to identify a c

Feature Scaling: Normalization and Standardization

Most algorithms will probably benefit from standardization more than from normalization. Some algorithms assume that our data is centered at 0. For example, if we initialize the weights of a small multi-layer perceptron with tanh activation units to

学习这篇总结后,你也能做出天天快报一样的推荐系统

一.推荐系统概述 1.1 概述 推荐系统目前几乎无处不在,主流的app都基本应用到了推荐系统.例如,旅游出行,携程.去哪儿等都会给你推荐机票.酒店等等:点外卖,饿了么.美团等会给你推荐饭店:购物的时候,京东.淘宝.亚马逊等会给你推荐“可能喜欢”的物品:看新闻,今日头条.腾讯新闻等都会给你推送你感兴趣的新闻....几乎所有的app应用或网站都存在推荐系统. 究其根本的原因,推荐系统的流行是因为要去解决一个问题:物品越来越多,信息越来越多,而人的精力和时间是有限的,需要一个方式去更有效率地获取信息,

Jaro-Winkler distance

Learning Textual Entailment using SVMs and String Similarity Measures http://delivery.acm.org/10.1145/1660000/1654547/p42-malakasiotis.pdf?ip=65.49.68.189&id=1654547&acc=OPEN&key=4D4702B0C3E38B35%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35%2E6D21814