Solr4.8.0源码分析(6)之非排序查询

Solr4.8.0源码分析(6)之非排序查询

上篇文章简单介绍了Solr的查询流程,本文开始将详细介绍下查询的细节。查询主要分为排序查询和非排序查询,由于两者走的是两个分支,所以本文先介绍下非排序的查询。

查询的流程主要在SolrIndexSearch.getDocListC(QueryResult qr, QueryCommand cmd),顾名思义该函数对queryResultCache进行处理,并根据查询条件选择进入排序查询还是非排序查询。

1   /**  2    * getDocList version that uses+populates query and filter caches.
  3    * In the event of a timeout, the cache is not populated.
  4    */
  5   private void getDocListC(QueryResult qr, QueryCommand cmd) throws IOException {
  6     DocListAndSet out = new DocListAndSet();
  7     qr.setDocListAndSet(out);
  8     QueryResultKey key=null;
  9     int maxDocRequested = cmd.getOffset() + cmd.getLen(); //当有偏移的查询产生,Solr首先会获取cmd.getOffset()+cmd.getLen()个的doc id然后                                    //再根据偏移量获取子集,所以maxDocRequested是实际的查询个数。
 10     // check for overflow, and check for # docs in index
 11     if (maxDocRequested < 0 || maxDocRequested > maxDoc()) maxDocRequested = maxDoc();// 最多的情况获取所有doc id
 12     int supersetMaxDoc= maxDocRequested;
 13     DocList superset = null;
 14
 15     int flags = cmd.getFlags();
 16     Query q = cmd.getQuery();
 17     if (q instanceof ExtendedQuery) {
 18       ExtendedQuery eq = (ExtendedQuery)q;
 19       if (!eq.getCache()) {
 20         flags |= (NO_CHECK_QCACHE | NO_SET_QCACHE | NO_CHECK_FILTERCACHE);
 21       }
 22     }
 23
 24
 25     // we can try and look up the complete query in the cache.
 26     // we can‘t do that if filter!=null though (we don‘t want to
 27     // do hashCode() and equals() for a big DocSet).        // 先从查询结果的缓存区查找是否出现过该条件的查询,若出现过则返回缓存的结果.关于缓存的内容将会独立写一篇文章
 28     if (queryResultCache != null && cmd.getFilter()==null
 29         && (flags & (NO_CHECK_QCACHE|NO_SET_QCACHE)) != ((NO_CHECK_QCACHE|NO_SET_QCACHE)))
 30     {
 31         // all of the current flags can be reused during warming,
 32         // so set all of them on the cache key.
 33         key = new QueryResultKey(q, cmd.getFilterList(), cmd.getSort(), flags);
 34         if ((flags & NO_CHECK_QCACHE)==0) {
 35           superset = queryResultCache.get(key);
 36
 37           if (superset != null) {
 38             // check that the cache entry has scores recorded if we need them
 39             if ((flags & GET_SCORES)==0 || superset.hasScores()) {
 40               // NOTE: subset() returns null if the DocList has fewer docs than
 41               // requested
 42               out.docList = superset.subset(cmd.getOffset(),cmd.getLen()); //如果有缓存,就从中去除一部分子集
 43             }
 44           }
 45           if (out.docList != null) {
 46             // found the docList in the cache... now check if we need the docset too.
 47             // OPT: possible future optimization - if the doclist contains all the matches,
 48             // use it to make the docset instead of rerunning the query.                //获取缓存中的docSet,并传给result。
 49             if (out.docSet==null && ((flags & GET_DOCSET)!=0) ) {
 50               if (cmd.getFilterList()==null) {
 51                 out.docSet = getDocSet(cmd.getQuery());
 52               } else {
 53                 List<Query> newList = new ArrayList<>(cmd.getFilterList().size()+1);
 54                 newList.add(cmd.getQuery());
 55                 newList.addAll(cmd.getFilterList());
 56                 out.docSet = getDocSet(newList);
 57               }
 58             }
 59             return;
 60           }
 61         }
 62
 63       // If we are going to generate the result, bump up to the
 64       // next resultWindowSize for better caching.
 65       // 修改supersetMaxDoc为queryResultWindwSize的整数倍
 66       if ((flags & NO_SET_QCACHE) == 0) {
 67         // handle 0 special case as well as avoid idiv in the common case.
 68         if (maxDocRequested < queryResultWindowSize) {
 69           supersetMaxDoc=queryResultWindowSize;
 70         } else {
 71           supersetMaxDoc = ((maxDocRequested -1)/queryResultWindowSize + 1)*queryResultWindowSize;
 72           if (supersetMaxDoc < 0) supersetMaxDoc=maxDocRequested;
 73         }
 74       } else {
 75         key = null;  // we won‘t be caching the result
 76       }
 77     }
 78     cmd.setSupersetMaxDoc(supersetMaxDoc);
 79
 80
 81     // OK, so now we need to generate an answer.
 82     // One way to do that would be to check if we have an unordered list
 83     // of results for the base query.  If so, we can apply the filters and then
 84     // sort by the resulting set.  This can only be used if:
 85     // - the sort doesn‘t contain score
 86     // - we don‘t want score returned.
 87
 88     // check if we should try and use the filter cache
 89     boolean useFilterCache=false;
 90     if ((flags & (GET_SCORES|NO_CHECK_FILTERCACHE))==0 && useFilterForSortedQuery && cmd.getSort() != null && filterCache != null) {
 91       useFilterCache=true;
 92       SortField[] sfields = cmd.getSort().getSort();
 93       for (SortField sf : sfields) {
 94         if (sf.getType() == SortField.Type.SCORE) {
 95           useFilterCache=false;
 96           break;
 97         }
 98       }
 99     }
100
101     if (useFilterCache) {
102       // now actually use the filter cache.
103       // for large filters that match few documents, this may be
104       // slower than simply re-executing the query.
105       if (out.docSet == null) {
106         out.docSet = getDocSet(cmd.getQuery(),cmd.getFilter());
107         DocSet bigFilt = getDocSet(cmd.getFilterList());
108         if (bigFilt != null) out.docSet = out.docSet.intersection(bigFilt);
109       }
110       // todo: there could be a sortDocSet that could take a list of
111       // the filters instead of anding them first...
112       // perhaps there should be a multi-docset-iterator
113       sortDocSet(qr, cmd);  //排序查询
114     } else {
115       // do it the normal way...
116       if ((flags & GET_DOCSET)!=0) {
117         // this currently conflates returning the docset for the base query vs
118         // the base query and all filters.
119         DocSet qDocSet = getDocListAndSetNC(qr,cmd);
120         // cache the docSet matching the query w/o filtering
121         if (qDocSet!=null && filterCache!=null && !qr.isPartialResults()) filterCache.put(cmd.getQuery(),qDocSet);
122       } else {
123         getDocListNC(qr,cmd); //非排序查询,这也是本文的流程。
124       }
125       assert null != out.docList : "docList is null";
126     }
127
128     if (null == cmd.getCursorMark()) {
129       // Kludge...
130       // we can‘t use DocSlice.subset, even though it should be an identity op
131       // because it gets confused by situations where there are lots of matches, but
132       // less docs in the slice then were requested, (due to the cursor)
133       // so we have to short circuit the call.
134       // None of which is really a problem since we can‘t use caching with
135       // cursors anyway, but it still looks weird to have to special case this
136       // behavior based on this condition - hence the long explanation.
137       superset = out.docList; //根据offset和len截取查询结果
138       out.docList = superset.subset(cmd.getOffset(),cmd.getLen());
139     } else {
140       // sanity check our cursor assumptions
141       assert null == superset : "cursor: superset isn‘t null";
142       assert 0 == cmd.getOffset() : "cursor: command offset mismatch";
143       assert 0 == out.docList.offset() : "cursor: docList offset mismatch";
144       assert cmd.getLen() >= supersetMaxDoc : "cursor: superset len mismatch: " +
145         cmd.getLen() + " vs " + supersetMaxDoc;
146     }
147
148     // lastly, put the superset in the cache if the size is less than or equal
149     // to queryResultMaxDocsCached
150     if (key != null && superset.size() <= queryResultMaxDocsCached && !qr.isPartialResults()) {
151       queryResultCache.put(key, superset);    //如果结果的个数小于或者等于queryResultMaxDocsCached则将本次查询结果放入缓存
152     }
153   }

进入非排序查询分支getDocListNC(),该函数内部分直接调用Lucene的IndexSearch.Search()

 1       final TopDocsCollector topCollector = buildTopDocsCollector(len, cmd); //新建TopDocsCollector对象,里面会新建(offset + len(查询条          //件的len))的HitQueue,每当获取到一个符合查询条件的doc,就会将该doc id放入HitQueue,并totalhit计数加一,这个totalhit变量也就是查询结果的数量
 2       Collector collector = topCollector;
 3       if (terminateEarly) {
 4         collector = new EarlyTerminatingCollector(collector, cmd.len);
 5       }
 6       if( timeAllowed > 0 ) {
 7         collector = new TimeLimitingCollector(collector, TimeLimitingCollector.getGlobalCounter(), timeAllowed);            //TimeLimitingCollector的实现原理很简单,从第一个找到符合查询条件的doc id开始计时,在达到timeAllowed之前,会想查询得到的doc id放入HitQue           //ue,一旦timeAllowed到了,就会立即扔出错误,中断后续的查询。这对于我们优化查询是个重要的提示
 8       }
 9       if (pf.postFilter != null) {
10         pf.postFilter.setLastDelegate(collector);
11         collector = pf.postFilter;
12       }
13       try {           // 进入Lucene的IndexSearch.Search()
14         super.search(query, luceneFilter, collector);
15         if(collector instanceof DelegatingCollector) {
16           ((DelegatingCollector)collector).finish();
17         }
18       }
19       catch( TimeLimitingCollector.TimeExceededException x ) {
20         log.warn( "Query: " + query + "; " + x.getMessage() );
21         qr.setPartialResults(true);
22       }
23
24       totalHits = topCollector.getTotalHits();           //返回totalhit的结果
25       TopDocs topDocs = topCollector.topDocs(0, len);    //返回优先级队列hitqueue的doc id
26       populateNextCursorMarkFromTopDocs(qr, cmd, topDocs);
27
28       maxScore = totalHits>0 ? topDocs.getMaxScore() : 0.0f;
29       nDocsReturned = topDocs.scoreDocs.length;
30       ids = new int[nDocsReturned];
31       scores = (cmd.getFlags()&GET_SCORES)!=0 ? new float[nDocsReturned] : null;
32       for (int i=0; i<nDocsReturned; i++) {
33         ScoreDoc scoreDoc = topDocs.scoreDocs[i];
34         ids[i] = scoreDoc.doc;
35         if (scores != null) scores[i] = scoreDoc.score;
36       }
TimeLimitingCollector统计查询结果的方法,一旦timeAllowed到了,就会立即扔出错误,中断后续的查询
  /**
   * Calls {@link Collector#collect(int)} on the decorated {@link Collector}
   * unless the allowed time has passed, in which case it throws an exception.
   *
   * @throws TimeExceededException
   *           if the time allowed has exceeded.
   */
  @Override
  public void collect(final int doc) throws IOException {
    final long time = clock.get();
    if (timeout < time) {
      if (greedy) {
        //System.out.println(this+"  greedy: before failing, collecting doc: "+(docBase + doc)+"  "+(time-t0));
        collector.collect(doc);
      }
      //System.out.println(this+"  failing on:  "+(docBase + doc)+"  "+(time-t0));
      throw new TimeExceededException( timeout-t0, time-t0, docBase + doc );
    }
    //System.out.println(this+"  collecting: "+(docBase + doc)+"  "+(time-t0));
    collector.collect(doc);
  }

接下来开始lucece的查询过程,

1. 首先会为每一个查询条件新建一个Weight的对象,最后将所有Weight对象放入ArrayList<Weight> weights。该过程给出每个查询条件的权重,并用于后续的评分过程。

 1     public BooleanWeight(IndexSearcher searcher, boolean disableCoord)
 2       throws IOException {
 3       this.similarity = searcher.getSimilarity();
 4       this.disableCoord = disableCoord;
 5       weights = new ArrayList<>(clauses.size());
 6       for (int i = 0 ; i < clauses.size(); i++) {
 7         BooleanClause c = clauses.get(i);
 8         Weight w = c.getQuery().createWeight(searcher);
 9         weights.add(w);
10         if (!c.isProhibited()) {
11           maxCoord++;
12         }
13       }
14     }

2. 遍历所有sgement,一个接一个的查找符合查询条件的doc id。AtomicReaderContext 是包含segment的具体信息,包括doc base,num docs,这些信息室非常有用的,在实现查询优化时候很有帮助。这里需要注意的是这个collector是TopDocsCollector类型的对象,这在上面的代码中已经赋值过了。

 1 /**
 2    * Lower-level search API.
 3    *
 4    * <p>
 5    * {@link Collector#collect(int)} is called for every document. <br>
 6    *
 7    * <p>
 8    * NOTE: this method executes the searches on all given leaves exclusively.
 9    * To search across all the searchers leaves use {@link #leafContexts}.
10    *
11    * @param leaves
12    *          the searchers leaves to execute the searches on
13    * @param weight
14    *          to match documents
15    * @param collector
16    *          to receive hits
17    * @throws BooleanQuery.TooManyClauses If a query would exceed
18    *         {@link BooleanQuery#getMaxClauseCount()} clauses.
19    */
20   protected void search(List<AtomicReaderContext> leaves, Weight weight, Collector collector)
21       throws IOException {
22
23     // TODO: should we make this
24     // threaded...?  the Collector could be sync‘d?
25     // always use single thread:
26     for (AtomicReaderContext ctx : leaves) { // search each subreader
27       try {
28         collector.setNextReader(ctx);
29       } catch (CollectionTerminatedException e) {
30         // there is no doc of interest in this reader context
31         // continue with the following leaf
32         continue;
33       }
34       BulkScorer scorer = weight.bulkScorer(ctx, !collector.acceptsDocsOutOfOrder(), ctx.reader().getLiveDocs());
35       if (scorer != null) {
36         try {
37           scorer.score(collector);
38         } catch (CollectionTerminatedException e) {
39           // collection was terminated prematurely
40           // continue with the following leaf
41         }
42       }
43     }
44   }

3. Weight.bulkScorer对查询条件进行评分,Lucene的多条件查询优化还是写的很不错的。Lucece会根据每个查询条件的词频对查询条件进行排序,词频小的排在前面,词频大的排在后面。这大大优化了多条件的查询。多条件查询的优化会在下文中详细介绍。

4. 最后Lucene会使用scorer.score(collector)这个过程真正的进行查询。看下Weight的两个函数,就能明白Lucene怎么进行查询统计。

 1  @Override
 2     public boolean score(Collector collector, int max) throws IOException {
 3       // TODO: this may be sort of weird, when we are
 4       // embedded in a BooleanScorer, because we are
 5       // called for every chunk of 2048 documents.  But,
 6       // then, scorer is a FakeScorer in that case, so any
 7       // Collector doing something "interesting" in
 8       // setScorer will be forced to use BS2 anyways:
 9       collector.setScorer(scorer);
10       if (max == DocIdSetIterator.NO_MORE_DOCS) {
11         scoreAll(collector, scorer);
12         return false;
13       } else {
14         int doc = scorer.docID();
15         if (doc < 0) {
16           doc = scorer.nextDoc();
17         }
18         return scoreRange(collector, scorer, doc, max);
19       }
20     }

Lucece会不停的从segment获取符合查询条件的doc,并放入collector的hitqueue里面。需要注意的是这里的collector是Collector类型,是TopDocsCollector等类的父类,所以scoreAll不仅能实现获取TopDocsCollector的doc is也能获取其他查询方式的doc id。

1     static void scoreAll(Collector collector, Scorer scorer) throws IOException {
2       int doc;
3       while ((doc = scorer.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
4         collector.collect(doc);
5       }
6     }

进入collector.collect(doc)查看TopDocsCollector的统计doc id的方式,就跟之前说的一样。

 1     @Override
 2     public void collect(int doc) throws IOException {
 3       float score = scorer.score();
 4
 5       // This collector cannot handle these scores:
 6       assert score != Float.NEGATIVE_INFINITY;
 7       assert !Float.isNaN(score);
 8
 9       totalHits++;
10       if (score <= pqTop.score) {
11         // Since docs are returned in-order (i.e., increasing doc Id), a document
12         // with equal score to pqTop.score cannot compete since HitQueue favors
13         // documents with lower doc Ids. Therefore reject those docs too.
14         return;
15       }
16       pqTop.doc = doc + docBase;
17       pqTop.score = score;
18       pqTop = pq.updateTop();
19     }
总结:本章详细的介绍了非排序查询的流程,主要涉及了以下几个类QueryComponent,SolrIndexSearch,TimeLimitingCollector,TopDocsCollector,IndexSearch,BulkScore,Weight. 篇幅原因,并没有将如何从segment里面获取doc id以及多条件查询是怎么实现的,这将是下一问多条件查询中详细介绍。

Solr4.8.0源码分析(6)之非排序查询,布布扣,bubuko.com

时间: 2024-10-06 00:16:43

Solr4.8.0源码分析(6)之非排序查询的相关文章

Solr4.8.0源码分析(22)之 SolrCloud的Recovery策略(三)

Solr4.8.0源码分析(22)之 SolrCloud的Recovery策略(三) 本文是SolrCloud的Recovery策略系列的第三篇文章,前面两篇主要介绍了Recovery的总体流程,以及PeerSync策略.本文以及后续的文章将重点介绍Replication策略.Replication策略不但可以在SolrCloud中起到leader到replica的数据同步,也可以在用多个单独的Solr来实现主从同步.本文先介绍在SolrCloud的leader到replica的数据同步,下一篇

Solr4.8.0源码分析(10)之Lucene的索引文件(3)

Solr4.8.0源码分析(10)之Lucene的索引文件(3) 1. .si文件 .si文件存储了段的元数据,主要涉及SegmentInfoFormat.java和Segmentinfo.java这两个文件.由于本文介绍的Solr4.8.0,所以对应的是SegmentInfoFormat的子类Lucene46SegmentInfoFormat. 首先来看下.si文件的格式 头部(header) 版本(SegVersion) doc个数(SegSize) 是否符合文档格式(IsCompoundF

Solr4.8.0源码分析(4)之Eclipse Solr调试环境搭建

Solr4.8.0源码分析(4)之Eclipse Solr调试环境搭建 由于公司里的Solr调试都是用远程jpda进行的,但是家里只有一台电脑所以不能jpda进行调试,这是因为jpda的端口冲突.所以只能在Eclipse 搭建Solr的环境,折腾了一小时终于完成了. 1. JDPA远程调试 搭建换完成Solr环境后,对${TOMCAT_HOME}/bin/startup.sh 最后一行进行修改,如下所示: 1 set JPDA_ADDRESS=7070 2 exec "$PRGDIR"

Solr4.8.0源码分析(25)之SolrCloud的Split流程

Solr4.8.0源码分析(25)之SolrCloud的Split流程(一) 题记:昨天有位网友问我SolrCloud的split的机制是如何的,这个还真不知道,所以今天抽空去看了Split的原理,大致也了解split的原理了,所以也就有了这篇文章.本系列有两篇文章,第一篇为core split,第二篇为collection split. 1. 简介 这里首先需要介绍一个比较容易混淆的概念,其实Solr的HTTP API 和 SolrCloud的HTTP API是不一样,如果接受到的是Solr的

Solr4.8.0源码分析(24)之SolrCloud的Recovery策略(五)

Solr4.8.0源码分析(24)之SolrCloud的Recovery策略(五) 题记:关于SolrCloud的Recovery策略已经写了四篇了,这篇应该是系统介绍Recovery策略的最后一篇了.本文主要介绍Solr的主从同步复制.它与前文<Solr4.8.0源码分析(22)之SolrCloud的Recovery策略(三)>略有不同,前文讲到的是SolrCloud的leader与replica之间的同步,不需要通过配置solrconfig.xml来实现.而本文主要介绍单机模式下,利用so

Solr4.8.0源码分析(19)之缓存机制(二)

Solr4.8.0源码分析(19)之缓存机制(二) 前文<Solr4.8.0源码分析(18)之缓存机制(一)>介绍了Solr缓存的生命周期,重点介绍了Solr缓存的warn过程.本节将更深入的来介绍下Solr的四种缓存类型,以及两种SolrCache接口实现类. 1.SolrCache接口实现类 前文已经提到SolrCache有两种接口实现类:solr.search.LRUCache 和 solr.search.LRUCache. 那么两者具体有啥区别呢? 1.1 solr.search.LR

Solr4.8.0源码分析(17)之SolrCloud索引深入(4)

Solr4.8.0源码分析(17)之SolrCloud索引深入(4) 前面几节以add为例已经介绍了solrcloud索引链建索引的三步过程,delete以及deletebyquery跟add过程大同小异,这里暂时就不介绍了.由于commit流程较为特殊,那么本节主要简要介绍下commit的流程. 1. SolrCloud的commit流程 SolrCloud的commit流程同样分为三步,本节主要简单介绍下三步过程. 1.1 LogUpdateProcessor LogUpdateProces

Solr4.8.0源码分析(14)之SolrCloud索引深入(1)

Solr4.8.0源码分析(14) 之 SolrCloud索引深入(1) 上一章节<Solr In Action 笔记(4) 之 SolrCloud分布式索引基础>简要学习了SolrCloud的索引过程,本节开始将通过阅读源码来深入学习下SolrCloud的索引过程. 1. SolrCloud的索引过程流程图 这里借用下<solrCloud Update Request Handling 更新索引流程>流程图: 由上图可以看出,SolrCloud的索引过程主要通过一个索引链过程来实

Solr4.8.0源码分析(12)之Lucene的索引文件(5)

Solr4.8.0源码分析(12)之Lucene的索引文件(5) 1. 存储域数据文件(.fdt和.fdx) Solr4.8.0里面使用的fdt和fdx的格式是lucene4.1的.为了提升压缩比,StoredFieldsFormat以16KB为单位对文档进行压缩,使用的压缩算法是LZ4,由于它更着眼于速度而不是压缩比,所以它能快速压缩以及解压. 1.1 存储域数据文件(.fdt) 真正保存存储域(stored field)信息的是fdt文件,该文件存放了压缩后的文档,按16kb或者更大的模块大