Lucene - CustomScoreQuery 自定义排序

在某些场景需要做自定义排序(非单值字段排序、非文本相关度排序),除了自己重写collect、weight,可以借助CustomScoreQuery。

场景:根据tag字段中标签的数量进行排序(tag字段中,标签的数量越多得分越高)

public class CustomScoreTest {
    public static void main(String[] args) throws IOException {
        Directory dir = new RAMDirectory();
        Analyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_4_9);
        IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_4_9, analyzer);
        IndexWriter writer = new IndexWriter(dir, conf);
        Document doc1 = new Document();
        FieldType type1 = new FieldType();
        type1.setIndexed(true);
        type1.setStored(true);
        type1.setStoreTermVectors(true);
        Field field1 = new Field("f1", "fox", type1);
        doc1.add(field1);
        Field field2 = new Field("tag", "fox1 fox2 fox3 ", type1);
        doc1.add(field2);
        writer.addDocument(doc1);
        //
        field1.setStringValue("fox");
        field2.setStringValue("fox1");
        doc1 = new Document();
        doc1.add(field1);
        doc1.add(field2);
        writer.addDocument(doc1);
        //
        field1.setStringValue("fox");
        field2.setStringValue("fox1 fox2 fox3 fox4");
        doc1 = new Document();
        doc1.add(field1);
        doc1.add(field2);
        writer.addDocument(doc1);
        //
        writer.commit();
        //
        IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(dir));
        Query query = new MatchAllDocsQuery();
        CountingQuery customQuery = new CountingQuery(query);
        int n = 10;
        TopDocs tds = searcher.search(query, n);
        ScoreDoc[] sds = tds.scoreDocs;
        for (ScoreDoc sd : sds) {
            System.out.println(searcher.doc(sd.doc));
        }
    }
}

测试结果:

Document<stored,indexed,tokenized,termVector<f1:fox> stored,indexed,tokenized,termVector<tag:fox1 fox2 fox3 >>
Document<stored,indexed,tokenized,termVector<f1:fox> stored,indexed,tokenized,termVector<tag:fox1>>
Document<stored,indexed,tokenized,termVector<f1:fox> stored,indexed,tokenized,termVector<tag:fox1 fox2 fox3 fox4>>

自定义打分:

public class CountingQuery extends CustomScoreQuery {

    public CountingQuery(Query subQuery) {
        super(subQuery);
    }

    protected CustomScoreProvider getCustomScoreProvider(AtomicReaderContext context) throws IOException {
        return new CountingQueryScoreProvider(context, "tag");
    }
}
public class CountingQueryScoreProvider extends CustomScoreProvider {

    String field;

    public CountingQueryScoreProvider(AtomicReaderContext context) {
        super(context);
    }

    public CountingQueryScoreProvider(AtomicReaderContext context, String field) {
        super(context);
        this.field = field;
    }

    public float customScore(int doc, float subQueryScore, float valSrcScores[]) throws IOException {
        IndexReader r = context.reader();
        Terms tv = r.getTermVector(doc, field);
        TermsEnum termsEnum = null;
        int numTerms = 0;
        if (tv != null) {
            termsEnum = tv.iterator(termsEnum);
            while ((termsEnum.next()) != null) {
                numTerms++;
            }
        }
        return (float) (numTerms);
    }

}

使用:

CountingQuery customQuery = new CountingQuery(query);

测试结果如下:

Document<stored,indexed,tokenized,termVector<f1:fox> stored,indexed,tokenized,termVector<tag:fox1 fox2 fox3 fox4>>
Document<stored,indexed,tokenized,termVector<f1:fox> stored,indexed,tokenized,termVector<tag:fox1 fox2 fox3 >>
Document<stored,indexed,tokenized,termVector<f1:fox> stored,indexed,tokenized,termVector<tag:fox1>>

//-----------------------

weight/score/similarity

collector

主要参考

http://opensourceconnections.com/blog/2014/03/12/using-customscorequery-for-custom-solrlucene-scoring/

快照:

One item stands out on that list as a little low-level but not quite as bad as building a custom Lucene query: CustomScoreQuery. When you implement your own Lucene query, you’re taking control of two things:

Matching – what documents should be included in the search results
Scoring – what score should be assigned to a document (and therefore what order should they appear in)
Frequently you’ll find that existing Lucene queries will do fine with matching but you’d like to take control of just the scoring/ordering. That’s what CustomScoreQuery gives you – the ability to wrap another Lucene Query and rescore it.

For example, let’s say you’re searching our favorite dataset – SciFi Stackexchange, A Q&A site dedicated to nerdy SciFi and Fantasy questions. The posts on the site are tagged by topic: “star-trek”, “star-wars”, etc. Lets say for whatever reason we want to search for a tag and order it by the number of tags such that questions with the most tags are sorted to the top.

In this example, a simple TermQuery could be sufficient for matching. To identify the questions tagged Star Trek with Lucene, you’d simply run the following query:

Term termToSearch = new Term(“tag”, “star-trek”);
TermQuery starTrekQ = new TermQuery(termToSearch);
searcher.search(starTrekQ);

If we examined the order of the results of this search, they’d come back in default TF-IDF order.

With CustomScoreQuery, we can intercept the matching query and assign a new score to it thus altering the order.

Step 1 Override CustomScoreQuery To Create Our Own Custom Scored Query Class:

(note this code can be found in this github repo)

public class CountingQuery extends CustomScoreQuery {

public CountingQuery(Query subQuery) {
super(subQuery);
}

protected CustomScoreProvider getCustomScoreProvider(
AtomicReaderContext context) throws IOException {
return new CountingQueryScoreProvider("tag", context);
}
}

Notice the code for “getCustomScoreProvider” this is where we’ll return an object that will provide the magic we need. It takes an AtomicReaderContext, which is a wrapper on an IndexReader. If you recall, this hooks us in to all the data structures available for scoring a document: Lucene’s inverted index, term vectors, etc.

Step 2 Create CustomScoreProvider

The real magic happens in CustomScoreProvider. This is where we’ll rescore the document. I’ll show you a boilerplate implementation before we dig in

public class CountingQueryScoreProvider extends CustomScoreProvider {

String _field;

public CountingQueryScoreProvider(String field, AtomicReaderContext context) {
super(context);
_field = field;
}

public float customScore(int doc, float subQueryScore, float valSrcScores[]) throws IOException {
return (float)(1.0f);
}
}

This CustomScoreProvider rescores all documents by returning a 1.0 score for them, thus negating their default relevancy sort order.

Step 3 Implement Rescoring

With TermVectors on for our field, we can simply loop through and count the tokens in the field:

public float customScore(int doc, float subQueryScore, float valSrcScores[]) throws IOException
{
IndexReader r = context.reader();
Terms tv = r.getTermVector(doc, _field);
TermsEnum termsEnum = null;
termsEnum = tv.iterator(termsEnum);
int numTerms = 0;
while((termsEnum.next()) != null) {
numTerms++;
}
return (float)(numTerms);
}

And there you have it, we’ve overridden the score of another query! If you’d like to see a full example, see my “lucene-query-example” repository that has this as well as my custom Lucene query examples.

CustomScoreQuery Vs A Full-Blown Custom Query

Creating a CustomScoreQuery is a much easier thing to do than implementing a complete query. There are A LOT of ins-and-outs for implementing a full-blown Lucene query. So when creating a custom matching behavior isn’t important and you’re only rescoring another Lucene query, CustomScoreQuery is a clear winner. Considering how frequently Lucene based technologies are used for “fuzzy” analytics, I can see using CustomScoreQuery a lot when the regular tricks don’t pan out.

时间: 2024-08-03 21:58:13

Lucene - CustomScoreQuery 自定义排序的相关文章

Lucene 中自定义排序的实现

使用Lucene来搜索内容,搜索结果的显示顺序当然是比较重要的.Lucene中Build-in的几个排序定义在大多数情况下是不适合我们使用的.要适合自己的应用程序的场景,就只能自定义排序功能,本节我们就来看看在Lucene中如何实现自定义排序功能. Lucene中的自定义排序功能和Java集合中的自定义排序的实现方法差不多,都要实现一下比较接口. 在Java中只要实现Comparable接口就可以了.但是在Lucene中要实现SortComparatorSource接口和 ScoreDocCom

一步一步跟我学习lucene(13)---lucene搜索之自定义排序的实现原理和编写自己的自定义排序工具

自定义排序说明 我们在做lucene搜索的时候,可能会需要排序功能,虽然lucene内置了多种类型的排序,但是如果在需要先进行某些值的运算然后在排序的时候就有点显得无能为力了: 要做自定义查询,我们就要研究lucene已经实现的排序功能,lucene的所有排序都是要继承FieldComparator,然后重写内部实现,这里以IntComparator为例子来查看其实现: IntComparator相关实现 其类的声明为 public static class IntComparator exte

定制对ArrayList的sort方法的自定义排序

java中的ArrayList需要通过collections类的sort方法来进行排序 如果想自定义排序方式则需要有类来实现Comparator接口并重写compare方法 调用sort方法时将ArrayList对象与实现Commparator接口的类的对象作为参数 示例: // 外部类的方式 import java.util.ArrayList; import java.util.Collections; import java.util.Comparator; import java.uti

MapReduce 学习4 ---- 自定义分区、自定义排序、自定义组分

1. map任务处理 1.3 对输出的key.value进行分区. 分区的目的指的是把相同分类的<k,v>交给同一个reducer任务处理. public static class MyPartitioner<Text, LongWritable> extends Partitioner<Text, LongWritable>{ static HashMap<String,Integer> map = null; static{ map = new Hash

Java集合框架实现自定义排序

Java集合框架针对不同的数据结构提供了多种排序的方法,虽然很多时候我们可以自己实现排序,比如数组等,但是灵活的使用JDK提供的排序方法,可以提高开发效率,而且通常JDK的实现要比自己造的轮子性能更优化. 一 .使用Arrays对数组进行排序 Java API对Arrays类的说明是:此类包含用来操作数组(比如排序和搜索)的各种方法. 1.使用Arrays排序:Arrays使用非常简单,直接调用sort()即可 int[] arr = new int[] {5,8,-2,0,10}; Array

php多维数组自定义排序 uasort()

php内置的排序函数很多:正反各种排: 常用的排序函数: sort() - 以升序对数组排序rsort() - 以降序对数组排序asort() - 根据值,以升序对关联数组进行排序ksort() - 根据键,以升序对关联数组进行排序arsort() - 根据值,以降序对关联数组进行排序krsort() - 根据键,以降序对关联数组进行排序 基本都能满足需求了:关于这些函数的使用方法就不多啰嗦了: 但是在项目的实际开发中还是会有些更加苛刻的排序需求:今天要介绍的排序函数是: uasort() ua

Hadoop之--&gt;自定义排序

data: 3 33 23 12 22 11 1 --------------------- 需求: 1 12 12 23 13 23 3 package sort; import java.io.IOException; import java.net.URI; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path

python 自定义排序函数

自定义排序函数 Python内置的 sorted()函数可对list进行排序: >>>sorted([36, 5, 12, 9, 21]) [5, 9, 12, 21, 36] 但 sorted()也是一个高阶函数,它可以接收一个比较函数来实现自定义排序,比较函数的定义是,传入两个待比较的元素 x, y,如果 x 应该排在 y 的前面,返回 -1,如果 x 应该排在 y 的后面,返回 1.如果 x 和 y 相等,返回 0. 因此,如果我们要实现倒序排序,只需要编写一个reversed_c

sql 自定义排序

方法一: 比如需要对SQL表中的字段NAME进行如下的排序: 张三(Z) 李四(L) 王五(W) 赵六(Z) 按照sql中的默认排序规则,根据字母顺序(a~z)排,结果为:李四  王五 赵六 张三   自定义排序:order by charindex(NAME,‘张三李四王五赵六’)  CHARINDEX函数返回字符或者字符串在另一个字符串中的起始位置.CHARINDEX函数调用方法如下:        CHARINDEX ( expression1 , expression2 [ , star