Solr相似度算法三:DRFSimilarity框架介绍

地址:http://terrier.org/docs/v3.5/dfr_description.html

The Divergence from Randomness (DFR) paradigm is a generalisation of one of the very first models of Information Retrieval, Harter‘s 2-Poisson indexing-model [1]. The 2-Poisson model is based on the hypothesis that the level of treatment of the informative words is witnessed by anelite set of documents, in which these words occur to a relatively greater extent than in the rest of the documents.

On the other hand, there are words, which do not possess elite documents, and thus their frequency follows a random distribution, which is the single Poisson model. Harter‘s model was first explored as a retrieval-model by Robertson, Van Rijsbergen and Porter [4]. Successively it was combined with standard probabilistic model by Robertson and Walker [3] and gave birth to the family of the BMs IR models (among them there is the well-known BM25 which is at the basis the Okapi system).

DFR models are obtained by instantiating the three components of the framework: selecting a basic randomness modelapplying the first normalisation and normalising the term frequencies.

Basic Randomness Models

The DFR models are based on this simple idea: "The more the divergence of the within-document term-frequency from its frequency within the collection, the more the information carried by the word t in the document d". In other words the term-weight is inversely related to the probability of term-frequency within the document d obtained by a model M of randomness:

(1)

where the subscript M stands for the type of model of randomness employed to compute the probability. In order to choose the appropriate model M of randomness, we can use different urn models. IR is thus seen as a probabilistic process, which uses random drawings from urn models, or equivalently random placement of coloured balls into urns. Instead of urns we have documents, and instead of different colours we have different terms, where each term occurs with some multiplicity in the urns as anyone of a number of related words or phrases which are called tokens of that term. There are many ways to choose M, each of these provides a basic DFR model. The basic models are derived in the following table.

Basic DFR Models
D Divergence approximation of the binomial
P Approximation of the binomial
BE Bose-Einstein distribution
G Geometric approximation of the Bose-Einstein
I(n) Inverse Document Frequency model
I(F) Inverse Term Frequency model
I(ne) Inverse Expected Document Frequency model

If the model M is the binomial distribution, then the basic model is P and computes the value1:

(2)

where:

  • TF is the term-frequency of the term t in the Collection
  • tf is the term-frequency of the term t in the document d
  • N is the number of documents in the Collection
  • p is 1/N and q=1-p

Similarly, if the model M is the geometric distribution, then the basic model is G and computes the value:

(3)

where λ = F/N.

First Normalisation

When a rare term does not occur in a document then it has almost zero probability of being informative for the document. On the contrary, if a rare term has many occurrences in a document then it has a very high probability (almost the certainty) to be informative for the topic described by the document. Similarly to Ponte and Croft‘s [2] language model, we include a risk component in the DFR models. If the term-frequency in the document is high then the risk for the term of not being informative is minimal. In such a case Formula (1) gives a high value, but a minimal risk has also the negative effect of providing a small information gain. Therefore, instead of using the full weight provided by the Formula (1), we tune or smooth the weight of Formula (1) by considering only the portion of it which is the amount of information gained with the term:

(4)

The more the term occurs in the elite set, the less term-frequency is due to randomness, and thus the smaller the probability Prisk is, that is:

(5)

We use two models for computing the information-gain with a term within a document: the Laplace L model and the ratio of two Bernoulli‘s processes B:

(6)

where df is the number of documents containing the term.

Term Frequency Normalisation

Before using Formula (4) the document-length dl is normalised to a standard length sl. Consequently, the term-frequencies tf are also recomputed with respect to the standard document-length, that is:

(7)

A more flexible formula, referred to as Normalisation2, is given below:

(8)

DFR Models are finally obtained from the generating Formula (4), using a basic DFR model (such as Formulae (2) or (3)) in combination with a model of information-gain (such as Formula 6) and normalising the term-frequency (such as in Formula (7) or Formula (8)).

DFR Models in Terrier

Included with Terrier, are many of the DFR models, including:

Model Description
BB2 Bernoulli-Einstein model with Bernoulli after-effect and normalisation 2.
IFB2 Inverse Term Frequency model with Bernoulli after-effect and normalisation 2.
In_expB2 Inverse Expected Document Frequency model with Bernoulli after-effect and normalisation 2. The logarithms are base 2. This model can be used for classic ad-hoc tasks.
In_expC2 Inverse Expected Document Frequency model with Bernoulli after-effect and normalisation 2. The logarithms are base e. This model can be used for classic ad-hoc tasks.
InL2 Inverse Document Frequency model with Laplace after-effect and normalisation 2. This model can be used for tasks that require early precision.
PL2 Poisson model with Laplace after-effect and normalisation 2. This model can be used for tasks that require early precision [78]

Recommended settings for various collection are provided in Example TREC Experiments.

Another provided weighting model is a derivation of the BM25 formula from the Divergence From Randomness framework. Finally, Terrier also provides a generic DFR weighting model, which allows any DFR model to be generated and evaluated.

Query Expansion

The query expansion mechanism extracts the most informative terms from the top-returned documents as the expanded query terms. In this expansion process, terms in the top-returned documents are weighted using a particular DFR term weighting model. Currently, Terrier deploys the Bo1 (Bose-Einstein 1), Bo2 (Bose-Einstein 2) and KL (Kullback-Leibler) term weighting models. The DFR term weighting models follow a parameter-free approach in default.

An alternative approach is Rocchio‘s query expansion mechanism. A user can switch to the latter approach by settingparameter.free.expansion to false in the terrier.properties file. The default value of the parameter beta of Rocchio‘s approach is 0.4. To change this parameter, the user needs to specify the property rocchio_beta in the terrier.properties file.

Fields

DFR can encapsulate the importance of term occurrences occurring in different fields in a variety of different ways:

  1. Per-field normalisation: The frequencies from the different fields in the documents are normalised with respect to the statistics of lengths typical for that field. This is as performed by the PL2F weighting model. Other per-field normalisation models can be generated using the generic PerFieldNormWeightingModel model.
  2. Multinomial: The frequencies from the different fields are modelled in their divergence from the randomness expected by the term‘s occurrences in that field. The ML2 and MDL2 models implement this weighting.

Proximity

Proximity can be handled within DFR, by considering the number of occurrences of a pair of query terms within a window of pre-defined size. In particular, the DFRDependenceScoreModifier DSM implements the pBiL and pBiL2 models, which measure the randomness compared to the document‘s length, rather than the statistics of the pair in the corpus.

DFR Models and Cross-Entropy

A different interpretation of the gain-risk generating Formula (4) can be explained by the notion of cross-entropy. Shannon‘s mathematical theory of communication in the 1940s [5] established that the minimal average code word length is about the value of the entropy of the probabilities of the source words. This result is known under the name of the Noiseless Coding Theorem. The term noiseless refers at the assumption of the theorem that there is no possibility of errors in transmitting words. Nevertheless, it may happen that different sources about the same information are available. In general each source produces a different coding. In such cases, we can make a comparison of the two sources of evidence using the cross-entropy. The cross entropy is minimised when the two pairs of observations return the same probability density function, and in such a case cross-entropy coincides with the Shannon‘s entropy.

We possess two tests of randomness: the first test is Prisk and is relative to the term distribution within its elite set, while the second ProbM is relative to the document with respect the entire collection. The first distribution can be treated as a new source of the term distribution, while the coding of the term with the term distribution within the collection can be considered as the primary source. The definition of the cross-entropy relation of these two probabilities distribution is:

(9)

Relation (9) is indeed Relation (4) of the DFR framework. DFR models can be equivalently defined as the divergence of two probabilities measuring the amount of randomness of two different sources of evidence.

For more details about the Divergence from Randomness framework, you may refer to the PhD thesis of Gianni Amati, or to Amati and Van Rijsbergen‘s paper Probabilistic models of information retrieval based on measuring divergence from randomness, TOIS 20(4):357-389, 2002.

[1] S.P. Harter. A probabilistic approach to automatic keyword indexing. PhD thesis, Graduate Library, The University of Chicago, Thesis No. T25146, 1974.
[2] J. Ponte and B. Croft. A Language Modeling Approach in Information Retrieval. In The 21st ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia, 1998), B. Croft, A.Moffat, and C.J. van Rijsbergen, Eds., pp.275-281. 
[3] S.E. Robertson and S. Walker. Some simple approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval. In Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (Dublin, Ireland, June 1994), Springer-Verlag, pp. 232-241. 
[4] S.E. Robertson, C.J. van Risjbergen and M. Porter. Probabilistic models of indexing and searching. In Information retrieval Research, S.E. Robertson, C.J. van Risjbergen and P. Williams, Eds. Butterworths, 1981, ch. 4, pp. 35-56.
[5] C. Shannon and W. Weaver. The Mathematical Theory of Communication. University of Illinois Press, Urbana, Illinois, 1949.
[6] B. He and I. Ounis. A study of parameter tuning for term frequency normalization, in Proceedings of the twelfth international conference on Information and knowledge management, New Orleans, LA, USA, 2003.
[7] B. He and I. Ounis. Term Frequency Normalisation Tuning for BM25 and DFR Model, in Proceedings of the 27th European Conference on Information Retrieval (ECIR‘05), 2005.
[8] V. Plachouras and I. Ounis. Usefulness of Hyperlink Structure for Web Information Retrieval. In Proceedings of ACM SIGIR 2004.
[9] V. Plachouras, B. He and I. Ounis. University of Glasgow in TREC 2004: experiments in Web, Robust and Terabyte tracks with Terrier. In Proceedings of the 13th Text REtrieval Conference (TREC 2004), 2004.

时间: 2024-10-08 04:27:39

Solr相似度算法三:DRFSimilarity框架介绍的相关文章

Solr相似度算法三:DRFSimilarity

该Similarity 实现了  divergence from randomness (偏离随机性)框架,这是一种基于同名概率模型的相似度模型. 该 similarity有以下配置选项: basic_model – 可能的值: be, d, g, if, in, ine 和 p. after_effect – 可能的值: no, b 和 l. normalization – 可能的值: no, h1, h2, h3 和 z.所有选项除了第一个,都需要一个标准值.

Solr相似度算法二:BM25Similarity

BM25算法的全称是 Okapi BM25,是一种二元独立模型的扩展,也可以用来做搜索的相关度排序. Sphinx的默认相关性算法就是用的BM25.Lucene4.0之后也可以选择使用BM25算法(默认是TF-IDF).如果你使用的solr,只需要修改schema.xml,加入下面这行就可以 <similarity class="solr.BM25Similarity"/> BM25也是基于词频的算分公式,分词对它的算分结果也很重要 IDF公式 f(qi,D):就是词频 |

Solr相似度算法二:Okapi BM25

地址:https://en.wikipedia.org/wiki/Okapi_BM25 In information retrieval, Okapi BM25 (BM stands for Best Matching) is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based o

[转]缓存、缓存算法和缓存框架简单介绍

引言 我们都听过 cache,当你问他们是什么是缓存的时候,他们会给你一个完美的答案.可是他们不知道缓存是怎么构建的.或者没有告诉你应该採用什么标准去选择缓存框架. 在这边文章,我们会去讨论缓存.缓存算法.缓存框架以及哪个缓存框架会更好. 面试 "缓存就是存贮数据(使用频繁的数据)的暂时地方,由于取原始数据的代价太大了.所以我能够取得快一些." 这就是 programmer one (programmer one 是一个面试者)在面试中的回答(一个月前,他向公司提交了简历,想要应聘要求

第三百零三节,Django框架介绍

Django框架介绍 Django是一个开放源代码的Web应用框架,由Python写成.采用了MVC的软件设计模式,即模型M,视图V和控制器C.它最初是被开发来用于管理劳伦斯出版集团旗下的一些以新闻内容为主的网站的,即是CMS(内容管理系统)软件.并于2005年7月在BSD许可证下发布. 这套框架是以比利时的吉普赛爵士吉他手Django Reinhardt来命名的. Django框架,流程图

基于word分词提供的文本相似度算法来实现通用的网页相似度检测

实现代码:基于word分词提供的文本相似度算法来实现通用的网页相似度检测 运行结果: 检查的博文数:128 1.检查博文:192本软件著作用词分析(五)用词最复杂99级,相似度分值:Simple=0.968589 Cosine=0.955598 EditDistance=0.916884 EuclideanDistance=0.00825 ManhattanDistance=0.001209 Jaccard=0.859838 JaroDistance=0.824469 JaroWinklerDi

Storm与Spark、Hadoop三种框架对比

一.Storm与Spark.Hadoop三种框架对比 Storm与Spark.Hadoop这三种框架,各有各的优点,每个框架都有自己的最佳应用场景.所以,在不同的应用场景下,应该选择不同的框架. 1.Storm是最佳的流式计算框架,Storm由Java和Clojure写成,Storm的优点是全内存计算,所以它的定位是分布式实时计算系统,按照Storm作者的说法,Storm对于实时计算的意义类似于Hadoop对于批处理的意义.Storm的适用场景:1)流数据处理Storm可以用来处理源源不断流进来

流式大数据处理的三种框架:Storm,Spark和Samza

许多分布式计算系统都可以实时或接近实时地处理大数据流.本文将对三种Apache框架分别进行简单介绍,然后尝试快速.高度概述其异同. Apache Storm 在Storm中,先要设计一个用于实时计算的图状结构,我们称之为拓扑(topology).这个拓扑将会被提交给集群,由集群中的主控节点(master node)分发代码,将任务分配给工作节点(worker node)执行.一个拓扑中包括spout和bolt两种角色,其中spout发送消息,负责将数据流以tuple元组的形式发送出去:而bolt

A*算法最简单的介绍

A*搜寻算法,俗称A星算法,作为启发式搜索算法中的一种,这是一种在图形平面上,有多个节点的路径,求出最低通过成本的算法.常用于游戏中的NPC的移动计算,或线上游戏的BOT的移动计算上.该算法像Dijkstra算法一样,可以找到一条最短路径:也像BFS一样,进行启发式的搜索. A*算法最为核心的部分,就在于它的一个估值函数的设计上:        f(n)=g(n)+h(n) 其中f(n)是每个可能试探点的估值,它有两部分组成:    一部分,为g(n),它表示从起始搜索点到当前点的代价(通常用某