Elasticsearch搜索之most_fields分析

顾名思义,most_field就是匹配词干的字段数越多,分数越高,也可设置权重boost。

下面是简易公式(详细评分算法请参考:http://m.blog.csdn.net/article/details?id=50623948):

score=match_field1_score*boost+match_field2_score*boost+...match_fieldN_score*boost。

在很多情况下,这种搜索很有效,但存在一个弱点,就是当文档中的字段冗余信息过多,将会影响那些文档比较精炼,而且意思较为全面的分值,

不能使用operator和minimum_should_match来减少相关性低的doc的长尾问题,简单的来说就是按term匹配的个数取胜

例下:

搜索关键字“北京东路”,先下面的分词结果,我们知道它的词干为“北京”与“东路”:

curl   ‘localhost:9200/fullbiz_index/_analyze?analyzer=ik_smart&pretty=true‘ -d ‘{"text":"北京东路"}‘
{
   "tokens" : [
      {
         "token" : "text",
         "start_offset" : 2,
         "end_offset" : 6,
         "type" : "ENGLISH",
         "position" : 1
      },
      {
         "token" : "北京",
         "start_offset" : 9,
         "end_offset" : 11,
         "type" : "CN_WORD",
         "position" : 2
      },
      {
         "token" : "东路",
         "start_offset" : 11,
         "end_offset" : 13,
         "type" : "CN_WORD",
         "position" : 3
      }
   ]
}
curl  ‘localhost:9200/fullbiz1/fullbizinfo/_search?pretty‘ -d ‘
{
  "from" : 0,
  "size" : 20,
  "query" : {
    "multi_match" : {
      "query" : "北京东路",
      "fields" : [ "title", "highlight", "tags", "address", "businessDistrict", "cuisineStyle" ],
      "type" : "most_fields",
	  "minimum_should_match" : "70%",//这是指最少匹配词干占比,例如三个词干,只要配置了二个以上就算match,66.6%会啥入70%。二个词干或以下,只要匹配了一个就行。所以“北京东路”只要匹配了“北京”或“东路”都可得分
      "analyzer" : "ik_smart" //ik有二种模式,一种是ik_max_word(最细词干法),ik_smart(最粗词干法),这里我们配置第二种,以更接近于业务结果。
    }
  },
  "post_filter" : {
    "bool" : {
      "must" : [ {
        "term" : {
          "status" : 0
        }
      }, {
        "term" : {
          "hostDisplay" : 1
        }
      }, {
        "term" : {
          "cityId" : 2
        }
      }, {
        "term" : {
          "productType" : 3
        }
      } ]
    }
  }
}‘
 
    "hits" : [ {
      "_index" : "fullbiz1",
      "_type" : "fullbizinfo",
      "_id" : "324239",
      "_score" : 0.33371,
      "_source":{"boost":1,"productId":24239,"productType":3,"subType":2,"title":"城市公牛(南京东路店)","viceTitle":"城市公牛(南京东路店)","personMax":"-1","personMin":"-1","picUrl":"meal/2016/08/11/1470892987880.jpg","recommand":-1,"needReserveTime":-1,"priceStr":"-1","price":"-1","originalPrice":"-1","leadingMinutes":-1,"tags":null,"status":0,"isFree":-1,"duration":"10:00:00-22:30:00","onlineTime":1470280723,"updateTime":1486951326,"applyExpiredTime":0,"beginTime":0,"endTime":0,"isCourse":-1,"isTour":-1,"supportParty":0,"interestedNum":0,"cityId":2,"cityName":"上海","categoryId":"0","categoryName":"","categoryIconUrl":"","businessDistrict":"南京东路","businessDistrictId":73,"hostId":24239,"contactNumber":"13764741956","hostName":"城市公牛(南京东路店)","address":"南京东路300号L221-222室(河南中路口)","hostDisplay":1,"hostPicUrl":"meal/2016/08/11/1470892987880.jpg","hostSharePicUrl":"meal/2016/08/11/1470892987880.jpg","hostLatitude":"31.243455970586","hostLongitude":"121.49099099941","location":{"lat":"31.243455970586","lon":"121.49099099941"},"hostLatitudeGD":"31.237701","hostLongitudeGD":"121.484409","locationGD":{"lat":"31.237701","lon":"121.484409"},"headPics":"","catalogIds":null,"cuisineStyleId":41,"cuisineStyle":"西餐","hideMask":0,"referenceAgeMin":0,"referenceAgeMax":0,"userLimit":-1,"todayReservable":1,"orderNums":3,"pvConversionRate":"-1","interestNums":0,"hotPoints":0,"hostAvgPrice":16000,"hostProductLabelIds":",1,2,4,5,7,8,9,12,13,14,15,","shopPay":0,"hostVipEquities":"0","isHostSale":0,"highlight":"[\"2010年世博会加拿大馆特约餐厅\",\"加拿大简约西部乡村风格小酒馆餐厅\",\"家庭式的用餐氛围 80%均是外国食客\"]","isSeatBook":1,"lastUTCTimestamp":"2017-02-13T10:02:06.000+08:00"}
    }, {
      "_index" : "fullbiz1",
      "_type" : "fullbizinfo",
      "_id" : "392659",
      "_score" : 0.31962717,
      "_source":{"boost":1,"productId":92659,"productType":3,"subType":4,"title":"THAIBEAUTY美容连锁机构(南京东路店)","viceTitle":"THAIBEAUTY美容连锁机构(南京东路店)","personMax":"-1","personMin":"-1","picUrl":"hostInfo/2017/01/11/1484121279773528.jpg","recommand":-1,"needReserveTime":-1,"priceStr":"-1","price":"-1","originalPrice":"-1","leadingMinutes":-1,"tags":"","status":0,"isFree":-1,"duration":null,"onlineTime":1484121281,"updateTime":1484202471,"applyExpiredTime":0,"beginTime":0,"endTime":0,"isCourse":-1,"isTour":-1,"supportParty":0,"interestedNum":0,"cityId":2,"cityName":"上海","categoryId":"0","categoryName":"","categoryIconUrl":"","businessDistrict":"南京东路","businessDistrictId":73,"hostId":92659,"contactNumber":"021-63511876","hostName":"THAIBEAUTY美容连锁机构(南京东路店)","address":"南京东路580号6楼","hostDisplay":1,"hostPicUrl":"hostInfo/2017/01/11/1484121279773528.jpg","hostSharePicUrl":"hostInfo/2017/01/11/1484121279773528.jpg","hostLatitude":"31.241721400027","hostLongitude":"121.48585125776","location":{"lat":"31.241721400027","lon":"121.48585125776"},"hostLatitudeGD":"31.235887","hostLongitudeGD":"121.479289","locationGD":{"lat":"31.235887","lon":"121.479289"},"headPics":"","catalogIds":null,"cuisineStyleId":0,"cuisineStyle":"美容/SPA","hideMask":-1,"referenceAgeMin":0,"referenceAgeMax":0,"userLimit":-1,"todayReservable":0,"orderNums":0,"pvConversionRate":"-1","interestNums":0,"hotPoints":0,"hostAvgPrice":284500,"hostProductLabelIds":",60,","shopPay":0,"hostVipEquities":"0","isHostSale":0,"highlight":"[\"高端局部瘦身\",\"环境舒适 按摩师手法专业\",\"使用高品质产品\"]","isSeatBook":1,"lastUTCTimestamp":"2017-01-12T14:27:51.000+08:00"}
    }, {
      "_index" : "fullbiz1",
      "_type" : "fullbizinfo",
      "_id" : "364804",
      "_score" : 0.31002828,
      "_source":{"boost":1,"productId":64804,"productType":3,"subType":2,"title":"斗牛士(南京东路店)","viceTitle":"斗牛士(南京东路店)","personMax":"-1","personMin":"-1","picUrl":"hostInfo/2016/12/26/1482718008927949.png","recommand":-1,"needReserveTime":-1,"priceStr":"-1","price":"-1","originalPrice":"-1","leadingMinutes":-1,"tags":"","status":0,"isFree":-1,"duration":null,"onlineTime":1482718014,"updateTime":1486569730,"applyExpiredTime":0,"beginTime":0,"endTime":0,"isCourse":-1,"isTour":-1,"supportParty":0,"interestedNum":0,"cityId":2,"cityName":"上海","categoryId":"0","categoryName":"","categoryIconUrl":"","businessDistrict":"南京东路","businessDistrictId":73,"hostId":64804,"contactNumber":"021-33317136","hostName":"斗牛士(南京东路店)","address":"南京东路353号悦荟广场(原353店)7F","hostDisplay":1,"hostPicUrl":"hostInfo/2016/12/26/1482718008927949.png","hostSharePicUrl":"hostInfo/2016/12/26/1482718008927949.png","hostLatitude":"31.24210523683","hostLongitude":"121.49020262932","location":{"lat":"31.24210523683","lon":"121.49020262932"},"hostLatitudeGD":"31.236339","hostLongitudeGD":"121.483623","locationGD":{"lat":"31.236339","lon":"121.483623"},"headPics":"","catalogIds":null,"cuisineStyleId":41,"cuisineStyle":"西餐","hideMask":-1,"referenceAgeMin":0,"referenceAgeMax":0,"userLimit":-1,"todayReservable":0,"orderNums":0,"pvConversionRate":"-1","interestNums":0,"hotPoints":0,"hostAvgPrice":12200,"hostProductLabelIds":",1,","shopPay":0,"hostVipEquities":"0","isHostSale":0,"highlight":"[\"精选进口澳洲安格斯牛排\",\"严控0度低温 保证牛肉鲜嫩\",\"进口原切牛排保证牛肉口感与外观\"]","isSeatBook":1,"lastUTCTimestamp":"2017-02-09T00:02:10.000+08:00"}
.....
      "_index" : "fullbiz1",
      "_type" : "fullbizinfo",
      "_id" : "353771",
      "_score" : 0.7784657,
      "_source":{"boost":1,"productId":53771,"productType":3,"subType":2,"title":"九储堂创意中国菜(外滩店)","viceTitle":"九储堂创意中国菜(外滩店)","personMax":"-1","personMin":"-1","picUrl":"hostInfo/2016/12/26/1482744127546461.jpg","recommand":-1,"needReserveTime":-1,"priceStr":"-1","price":"-1","originalPrice":"-1","leadingMinutes":-1,"tags":"","status":0,"isFree":-1,"duration":null,"onlineTime":1482744132,"updateTime":1486738928,"applyExpiredTime":0,"beginTime":0,"endTime":0,"isCourse":-1,"isTour":-1,"supportParty":0,"interestedNum":0,"cityId":2,"cityName":"上海","categoryId":"0","categoryName":"","categoryIconUrl":"","businessDistrict":"外滩","businessDistrictId":71,"hostId":53771,"contactNumber":"021-63308900","hostName":"九储堂创意中国菜(外滩店)","address":"北京东路398号新协通国际大酒店18楼","hostDisplay":1,"hostPicUrl":"hostInfo/2016/12/26/1482744127546461.jpg","hostSharePicUrl":"hostInfo/2016/12/26/1482744127546461.jpg","hostLatitude":"31.246247363994","hostLongitude":"121.48894308136","location":{"lat":"31.246247363994","lon":"121.48894308136"},"hostLatitudeGD":"31.240463","hostLongitudeGD":"121.48237","locationGD":{"lat":"31.240463","lon":"121.48237"},"headPics":"","catalogIds":null,"cuisineStyleId":25,"cuisineStyle":"创意菜","hideMask":-1,"referenceAgeMin":0,"referenceAgeMax":0,"userLimit":-1,"todayReservable":0,"orderNums":0,"pvConversionRate":"-1","interestNums":0,"hotPoints":0,"hostAvgPrice":19100,"hostProductLabelIds":",1,","shopPay":0,"hostVipEquities":"0","isHostSale":0,"highlight":"[\"新加坡同乐餐饮总厨胡于保先生主理\",\"大厅可容纳150人的宴会 包房5间\",\"靠窗座位亦可欣赏浦江两岸美景\"]","isSeatBook":1,"lastUTCTimestamp":"2017-02-10T23:02:08.000+08:00"}

而结果中有包含“北京东路”完整内容的文档却排在后面,这不科学,为什么会是这个结果,下面我们经过explain来看看评分计算:

curl  ‘localhost:9200/fullbiz1/fullbizinfo/_search?pretty&explain‘  ....后面内容省略,和上面的请求是一样,只加了一个explain,以及size限制第一条,因为信息太多,只分析具体一个文档,下面我们直接看评分部分:

      "_explanation" : {
        "value" : 0.33371,
        "description" : "product of:",
        "details" : [ {
          "value" : 0.66742,
          "description" : "sum of:",
          "details" : [ {
            "value" : 0.28481156,
            "description" : "product of:",
            "details" : [ {
              "value" : 0.5696231,
              "description" : "sum of:",
              "details" : [ {
                "value" : 0.5696231,
                "description" : "weight(title:东路 in 7321) [PerFieldSimilarity], result of:",
                "details" : [ {
                  "value" : 0.5696231,
                  "description" : "score(doc=7321,freq=1.0), product of:",
                  "details" : [ {
                    "value" : 0.25448462,
                    "description" : "queryWeight, product of:",
                    "details" : [ {
                      "value" : 7.1626873,
                      "description" : "idf(docFreq=244, maxDocs=116302)"
                    }, {
                      "value" : 0.03552921,
                      "description" : "queryNorm"
                    } ]
                  }, {
                    "value" : 2.23834,
                    "description" : "fieldWeight in 7321, product of:",
                    "details" : [ {
                      "value" : 1.0,
                      "description" : "tf(freq=1.0), with freq of:",
                      "details" : [ {
                        "value" : 1.0,
                        "description" : "termFreq=1.0"
                      } ]
                    }, {
                      "value" : 7.1626873,
                      "description" : "idf(docFreq=244, maxDocs=116302)"
                    }, {
                      "value" : 0.3125,
                      "description" : "fieldNorm(doc=7321)"
                    } ]
                  } ]
                } ]
              } ]
            }, {
              "value" : 0.5,
              "description" : "coord(1/2)"
            } ]
          }, {
            "value" : 0.067192085,
            "description" : "product of:",
            "details" : [ {
              "value" : 0.13438417,
              "description" : "sum of:",
              "details" : [ {
                "value" : 0.13438417,
                "description" : "weight(address:东路 in 7321) [PerFieldSimilarity], result of:",
                "details" : [ {
                  "value" : 0.13438417,
                  "description" : "score(doc=7321,freq=1.0), product of:",
                  "details" : [ {
                    "value" : 0.1477382,
                    "description" : "queryWeight, product of:",
                    "details" : [ {
                      "value" : 4.158218,
                      "description" : "idf(docFreq=4942, maxDocs=116302)"
                    }, {
                      "value" : 0.03552921,
                      "description" : "queryNorm"
                    } ]
                  }, {
                    "value" : 0.90961015,
                    "description" : "fieldWeight in 7321, product of:",
                    "details" : [ {
                      "value" : 1.0,
                      "description" : "tf(freq=1.0), with freq of:",
                      "details" : [ {
                        "value" : 1.0,
                        "description" : "termFreq=1.0"
                      } ]
                    }, {
                      "value" : 4.158218,
                      "description" : "idf(docFreq=4942, maxDocs=116302)"
                    }, {
                      "value" : 0.21875,
                      "description" : "fieldNorm(doc=7321)"
                    } ]
                  } ]
                } ]
              } ]
            }, {
              "value" : 0.5,
              "description" : "coord(1/2)"
            } ]
          }, {
            "value" : 0.3154164,
            "description" : "product of:",
            "details" : [ {
              "value" : 0.6308328,
              "description" : "sum of:",
              "details" : [ {
                "value" : 0.6308328,
                "description" : "weight(businessDistrict:东路 in 7321) [PerFieldSimilarity], result of:",
                "details" : [ {
                  "value" : 0.6308328,
                  "description" : "score(doc=7321,freq=1.0), product of:",
                  "details" : [ {
                    "value" : 0.22633977,
                    "description" : "queryWeight, product of:",
                    "details" : [ {
                      "value" : 6.3705263,
                      "description" : "idf(docFreq=540, maxDocs=116302)"
                    }, {
                      "value" : 0.03552921,
                      "description" : "queryNorm"
                    } ]
                  }, {
                    "value" : 2.7871053,
                    "description" : "fieldWeight in 7321, product of:",
                    "details" : [ {
                      "value" : 1.0,
                      "description" : "tf(freq=1.0), with freq of:",
                      "details" : [ {
                        "value" : 1.0,
                        "description" : "termFreq=1.0"
                      } ]
                    }, {
                      "value" : 6.3705263,
                      "description" : "idf(docFreq=540, maxDocs=116302)"
                    }, {
                      "value" : 0.4375,
                      "description" : "fieldNorm(doc=7321)"
                    } ]
                  } ]
                } ]
              } ]
            }, {
              "value" : 0.5,
              "description" : "coord(1/2)"
            } ]
          } ]
        }, {
          "value" : 0.5,
          "description" : "coord(3/6)"
        } ]
      }
    } ]
  }
}

从上面分析结果来看,排在前面的这些包含“南京东路”的文档,不是因为匹配度高,而是因为匹配的字段多,所以得分大于下面那个只包含一个“北京东路”字段的文档。

总结:most_field适应于那种字段之间信息差异较大的搜索匹配,像上面那种title中有“东路”,商圈、地址中也有“东路“,冗余信息较多。

时间: 2024-10-28 14:43:17

Elasticsearch搜索之most_fields分析的相关文章

Elasticsearch搜索之cross_fields分析

cross_fields类型采用了一种以词条为中心(Term-centric)的方法,这种方法和best_fields及most_fields采用的以字段为中心(Field-centric)的方法有很大的区别. 它将所有的字段视为一个大的字段,然后在任一字段中搜索每个词条. operator:operator设为了and,表示所有的词条都需要出现: minimum_should_match:表示文本匹配度,控制搜索精度,向下取整. 相比most_fields与best_fields,解释起来可能

Elasticsearch搜索之best_fields分析

顾名思义,best_field就是获取最佳匹配的field,另个可以通过tie_breaker来控制其他field的得分,boost可以设置权重(默认都为1). 下面从宏观上来讲的简单公式: score=best_field.score*boost+other_fields*boost.score*tie_breaker. 实际计算远比这个公式复杂得多,还要考虑分片因素.出现位置.文档长短等. 评分算法请参考:http://m.blog.csdn.net/article/details?id=5

使用python操作elasticsearch实现数据插入分析

前言: 例行公事,有些人可能不太了解elasticsearch,下面搜了一段,大家瞅一眼. Elasticsearch是一款分布式搜索引擎,支持在大数据环境中进行实时数据分析.它基于Apache Lucene文本搜索引擎,内部功能通过ReST API暴露给外部.除了通过HTTP直接访问Elasticsearch,还可以通过支持Java.JavaScript.Python及更多语言的客户端库来访问.它也支持集成Apache Hadoop环境.Elasticsearch在有些处理海量数据的公司中已经

ELK(ElasticSearch+Logstash+Kibana)日志分析工具

日志主要包括系统日志.应用程序日志和安全日志.系统运维和开发人员可以通过日志了解服务器软硬件信息.检查配置过程中的错误及错误发生的原因.经常分析日志可以了解服务器的负荷,性能安全性,从而及时采取措施纠正错误. 通常,日志被分散的储存不同的设备上.如果你管理数十上百台服务器,你还在使用依次登录每台机器的传统方法查阅日志.这样是不是感觉很繁琐和效率低下.当务之急我们使用集中化的日志管理,例如:开源的syslog,将所有服务器上的日志收集汇总. 集中化管理日志后,日志的统计和检索又成为一件比较麻烦的事

通过HTTP RESTful API 操作elasticsearch搜索数据

通过HTTP RESTful API 操作elasticsearch搜索数据

elasticsearch搜索提示

elasticsearch搜索提示(补全)接口需要新增suggest字段并设type为:completion,结合到scrapy,修改es_types.py文件: from datetime import datetime from elasticsearch_dsl import DocType, Date, Nested, Boolean, analyzer, InnerObjectWrapper, Completion, Keyword, Text, Integer from elasti

网站基于ElasticSearch搜索的优化笔记 PHP

基本情况就是,媒体.试题.分类,媒体可能有多个试题,一个试题可能有多个分类,分类为三级分类加上一个综合属性.通过试题名称.分类等搜索查询媒体. 现在的问题为,搜索结果不精确,部分搜索无结果,ES的数据结构不满足搜索需求.解决方案就是,重构ES数据结构,采用父子关系的方式,建立media和question两个type. 全程使用https://github.com/mobz/elasticsearch-head,这个进行ES的管理和查看,很方便. 从ES的说明可以看出,ES是面向文档,其实所有的数

ElasticSearch搜索介绍四

ElasticSearch搜索 最基础的搜索: curl -XGET http://localhost:9200/_search 返回的结果为: { "took": 2, "timed_out": false, "_shards": { "total": 16, "successful": 16, "failed": 0 }, "hits": { "tota

Elasticsearch搜索结果返回不一致问题

一.背景 这周在使用Elasticsearch搜索的时候遇到一个,对于同一个搜索请求,会出现top50返回结果和排序不一致的问题.那么为什么会出现这样的问题? 后来通过百度和google,发现这是因为Elastcisearch的分布式搜索特性导致.Elasticsearch在搜索时,会循环的选择主分片和其副本中的一个来计算和返回搜索结果,而由于主分片和副本中相关统计信息的不同,从而导致了同一个搜索串的评分的不一致,进而导致排序不一样.而造成这种主分片和副本统计信息不一致的具体原因,是因为文档删除