[SimHash] find the percentage of similarity between two given data

SimHash algorithm, introduced by Charikarand is patented by Google.

Simhash 5 steps: Tokenize, Hash, Weigh Values, Merge, Dimensionality Reduction

  • tokenize

    • tokenize your data, assign weights to each token, weights and tokenize function are depend on your business
  • hash (md5, SHA1)
    • calculate token‘s hash value and convert it to binary (101011 )
  • weigh values
    • for each hash value, do hash*w, in this way: (101011 ) -> (w,-w,w,-w,w,w)
  • merge
    • add up tokens‘ values, to merge to 1 hash, for example, merge (4 -4 -4 4 -4 4) and (5 -5 5 -5 5 5) , results to (4+5 -4+-5 -4+5 4+-5 -4+5 4+5),which is (9 -9 1 -1 1)
  • Dimensionality Reduction
    • Finally, signs of elements of V corresponds to the bits of the final fingerprint, for example (9 -9 1 -1 1) -> (1 0 1 0 1), we get 10101 as the fingerprint.

How to use SimHash fingerprints?

Hamming distance can be used to find the similarity between two given data, calculate the Hamming distance between 2 fingerprints.

Based on my experience, for 64 bit SimHash values, with elaborate weight values,  distance of similar data

often differ appreciably in magnitude from those unsimilar data.

how to calculate:XOR, 只有两个位不同时结果是1 ,否则为0,两个二进制value“异或”后得到1的个数 为海明距离 。

simhash 0.1.0 : Python Package Index

时间: 2024-11-14 12:15:24

[SimHash] find the percentage of similarity between two given data的相关文章

maker 2008年发表在genome Res

简单好用 identify repeats, to align ESTs and proteins to the genome, and to automatically synthesize these data into feature-rich gene annotations, including alternative splicing and UTRs, as well as attributes such as evidence trails, and confidence mea

三十七、git diff简介

原文: http://web.mit.edu/~mkgray/project/silk/root/afs/sipb/project/git/git-doc/git-diff.html git diff可以比较working tree同index之间,index和git directory之间,working tree和git directory之间,git directory中不同commit之间的差异,同时可以通过[<path>...]参数将比较限定于特点的目录或文件 . git diff

PHP一些实用的自定义函数收集

1. PHP可阅读随机字符串 此代码将创建一个可阅读的字符串,使其更接近词典中的单词,实用且具有密码验证功能. /************** *@length - length of random string (must be a multiple of 2) **************/ function readable_random_string($length = 6){ $conso=array("b","c","d","

21个常用代码片段

21个常用的PHP函数代码段 1. PHP可阅读随机字符串 此代码将创建一个可阅读的字符串,使其更接近词典中的单词,实用且具有密码验证功能. /***************@length – length of random string (must be a multiple of 2)**************/function readable_random_string($length = 6){$conso=array("b","c","d&

CS 659 Image Processing and Analysis

Dr. Frank ShihCS 659 Image Processing and AnalysisResearch Paper 1 GuidelineTotal: 20%In order to obtain full credits, you need to read and follow this guideline carefully andconduct your own research independently. You could refer to the posted “Sam

机器学习初学者的10个课程推荐

转自:https://hackerlists.com/beginner-ml-courses/ 10 Machine Learning Online Courses For Beginners 10 Machine Learning Online Courses For Beginners The following is a list of, mostly free, machine learning online courses for beginners. If video lecture

用日历存储信息

日历显示主界面 <body> <form id="form1" runat="server"> <div> <asp:Calendar ID="Calendar1" runat="server" CellPadding="0" FirstDayOfWeek="Sunday" NextMonthText="下一月" OnDayR

Classical method of machine learning

1. PCA principal components analysis 主要是通过对协方差矩阵Covariance matrix进行特征分解,以得出数据的主成分(即特征向量eigenvector)与它们的权值(即特征值eigenvalue). PCA是最简单的以特征量分析多元统计分布的方法.其结果可以理解为对原数据中的方差variance做出解释:哪一个方向上的数据值对方差的影响最大?换而言之,PCA提供了一种降低数据维度的有效办法:如果分析者在原数据中除掉最小的特征值所对应的成分,那么所得的

ORA-31693, ORA-02354 and ORA-01555 with Export Datapump

Symptoms ORA-31693: Table data object "YXFUND"."MF_NOTTEXTANNOUNCEMENT" failed to load/unload and is being skipped due to error: ORA-02354: error in exporting/importing data ORA-01555: snapshot too old: rollback segment number 10 with