Distributed Sentence Similarity Base on Word Mover's Distance

Algorithm:

Refrence from one ICML15 paper: Word Mover‘s Distance.

1. First use Google‘s word2vec tool to get distributed word representing aka. word vectors.

2. Then use earth mover‘s distance as similarity measure metric.

3. Solve the EMD problem as transportation problem by Hungarian Algorithm.



Outcome:

Result looks not bad, but still have ways to improve the precision.

For example: use n-gram to keep a little bit sentence structure.

Distributed Sentence Similarity Base on Word Mover's Distance

时间: 2024-08-06 11:57:19

Distributed Sentence Similarity Base on Word Mover's Distance的相关文章

如何使用SAS计算Word Mover的距离

Word Mover的距离(WMD)是用于衡量两个文档之间差异的距离度量,它在文本分析中的应用是由华盛顿大学的一个研究小组在2015年引入的.该小组的论文" 从Word嵌入到文档距离"发表了在第32届国际机器学习大会(ICML)上.在本文中,他们证明了WMD度量导致8个真实世界文档分类数据集中前所未有的低k-最近邻文档分类错误率. 他们利用单词嵌入和WMD对文档进行分类,这种方法相对于传统方法的最大优点是它能够将单个单词对(例如总统和奥巴马)之间的语义相似性合并到文档距离度量中.以传统

Comparing Sentence Similarity Methods

Reference:Comparing Sentence Similarity Methods,知乎. 原文地址:https://www.cnblogs.com/niuxichuan/p/9463629.html

搬土距离(Earth Mover's Distance)

搬土距离(The Earth Mover's Distance,EMD)最早由Y. Rubner在1999年的文章<A Metric for Distributions with Applications to Image Databases>中提出,它是归一化的从一个分布变为另一个分布的最小代价,因此可用于表征两个分布之间的距离. 例如,对于图像而言,它可以看做是由色调.饱和度.亮度三个分量组成,每个分量的直方图就是一个分布.不同的图像对应的直方图不同,因此图像之间的距离可以用直方图的距离表

图像检索之EMD距离(Earth Mover&#39;s Distance)

在理解EMD距离模型时,需要先对<运筹学>中运输问题,做一下了解.下面给出几个运输问题的小例子,作为补充知识: 那么,对于上述问题我们发现是一个 产量=销量=500 ,即产销平衡的问题,可以提出这样的数学模型: 假设运到物品的个数为,用代表运到单个物品的运费(在上述表格中都有),用表示产地的产量,表示销地的销量,则总运费为,使总运费最小的数学模型为: 还有令两种可能就是 产量>销量 或者 产量<销量,这里不做模型的讨论,上面三种运输问题都可以用单纯形法进行求解.因为只有当"

[转][译]一种度量准则:推土机距离Earth Mover&#39;s Distance(EMD)

以下内容为罗方炜译: Earth mover's distance In computer science, the earth mover's distance (EMD) is a measure of the distance between two probability distributions over a region D. In mathematics, this is known as the Wasserstein metric. Informally, if the di

[LeetCode] Sentence Similarity

Given two sentences words1, words2 (each represented as an array of strings), and a list of similar word pairs pairs, determine if two sentences are similar. For example, "great acting skills" and "fine drama talent" are similar, if th

[LeetCode] Sentence Similarity 句子相似度

Given two sentences words1, words2 (each represented as an array of strings), and a list of similar word pairs pairs, determine if two sentences are similar. For example, "great acting skills" and "fine drama talent" are similar, if th

[LeetCode] Sentence Similarity II 句子相似度之二

Given two sentences words1, words2 (each represented as an array of strings), and a list of similar word pairs pairs, determine if two sentences are similar. For example, words1 = ["great", "acting", "skills"] and words2 = [&

计算句子相似度的方法

方法1:无监督,不使用额外的标注数据 average word vectors:简单的对句子中的所有词向量取平均,是一种简单有效的方法, 缺点:没有考虑到单词的顺序,只对15个字以内的短句子比较有效,丢掉了词与词间的相关意思,无法更精细的表达句子与句子之间的关系. tfidf-weighting word vectors:指对句子中的所有词向量根据tfidf权重加权求和,是常用的一种计算sentence embedding的方法,在某些问题上表现很好,相比于简单的对所有词向量求平均,考虑到了tf