Elasticsearch 自定义多个分析器

分析器(Analyzer)
Elasticsearch 无论是内置分析器还是自定义分析器，都由三部分组成：字符过滤器(Character Filters)、分词器(Tokenizer)、词元过滤器(Token Filters)。

分析器Analyzer工作流程：

Input Text => Character Filters(如果有多个，按顺序应用) => Tokenizer => Token Filters(如果有多个，按顺序应用) => Output Token

字符过滤器(Character Filters)
字符过滤器：对原始文本预处理，如去除HTML标签，”&”转成”and”等。

注意：一个分析器同时有多个字符过滤器时，按顺序应用。

分词器(Tokenizer)
分词器：将字符串分解成一系列的词元Token。如根据空格将英文单词分开。

词元过滤器(Token Filters)
词元过滤器：对分词器分出来的词元Token做进一步处理，如转换大小写、移除停用词、单复数转换、同义词转换等。

注意：一个分析器同时有多个词元过滤器时，按顺序应用。

分析器analyze API的使用
分析器analyze API可验证分析器的分析效果并解释分析过程。

# text: 待分析文本
# explain:解释分析过程
# char_filter:字符过滤器
# tokenizer:分词器
# filter:词元过滤器

GET _analyze
{
  "char_filter" : ["html_strip"],
  "tokenizer": "standard",
  "filter":  [ "lowercase"],
  "text": "<p><em>No <b>dreams</b>, why bother <b>Beijing</b> !</em></p>",
  "explain" : true
}

自定义多个分析器
创建索引并自定义多个分析器
这里对一个索引同时定义了多个分析器。

PUT my_index
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "analysis": {
      "char_filter": { //自定义多个字符过滤器
        "my_charfilter1": {
          "type": "mapping",
          "mappings": ["& => and"]
        },
        "my_charfilter2": {
          "type": "pattern_replace",
          "pattern": "(\\d+)-(?=\\d)",
          "replacement": "$1_"
        }
      },
      "tokenizer":{  //自定义多个分词器
          "my_tokenizer1": {
              "pattern":"\\s+",
              "type":"pattern"
            },
          "my_tokenizer2":{
                "pattern":"_",
                "type":"pattern"
            }
      },
      "filter": {  //自定义多个词元过滤器
        "my_tokenfilter1": {
          "type": "stop",
          "stopwords": ["the", "a","an"]
        },
        "my_tokenfilter2": {
          "type": "stop",
          "stopwords": ["info", "debug"]
        }
      },
      "analyzer": { //自定义多个分析器
         "my_analyzer1":{  //分析器my_analyzer1
           "char_filter": ["html_strip", "my_charfilter1","my_charfilter2"],
           "tokenizer":"my_tokenizer1",
           "filter": ["lowercase", "my_tokenfilter1"]
         },
         "my_analyzer2":{  //分析器my_analyzer2
           "char_filter": ["html_strip"],
           "tokenizer":"my_tokenizer2",
           "filter": ["my_tokenfilter2"]
         }
      }
    }
  }
}

验证索引my_index的多个分析器
验证分析器my_analyzer1分析效果
GET /my_index/_analyze
{
  "text": "<b>Tom </b> & <b>jerry</b> in the room number 1-1-1",
  "analyzer": "my_analyzer1"//,
  //"explain": true
}

#返回结果
{
  "tokens": [
    {
      "token": "tom",
      "start_offset": 3,
      "end_offset": 6,
      "type": "word",
      "position": 0
    },
    {
      "token": "and",
      "start_offset": 12,
      "end_offset": 13,
      "type": "word",
      "position": 1
    },
    {
      "token": "jerry",
      "start_offset": 17,
      "end_offset": 26,
      "type": "word",
      "position": 2
    },
    {
      "token": "in",
      "start_offset": 27,
      "end_offset": 29,
      "type": "word",
      "position": 3
    },
    {
      "token": "room",
      "start_offset": 34,
      "end_offset": 38,
      "type": "word",
      "position": 5
    },
    {
      "token": "number",
      "start_offset": 39,
      "end_offset": 45,
      "type": "word",
      "position": 6
    },
    {
      "token": "1_1_1",
      "start_offset": 46,
      "end_offset": 51,
      "type": "word",
      "position": 7
    }
  ]
}

验证分析器my_analyzer2分析效果
GET /my_index/_analyze
{
  "text": "<b>debug_192.168.113.1_971213863506812928</b>",
  "analyzer": "my_analyzer2"//,
  //"explain": true
}

#返回结果
{
  "tokens": [
    {
      "token": "192.168.113.1",
      "start_offset": 9,
      "end_offset": 22,
      "type": "word",
      "position": 1
    },
    {
      "token": "971213863506812928",
      "start_offset": 23,
      "end_offset": 45,
      "type": "word",
      "position": 2
    }
  ]
}

添加Mapping并为不同字段设置不同分析器
PUT my_index/_mapping/my_type
{
      "properties": {
      "my_field1": {
        "type": "text",
        "analyzer": "my_analyzer1",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "my_field2": {
        "type": "text",
        "analyzer": "my_analyzer2",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      }
    }
}

创建文档
PUT my_index/my_type/1
{
  "my_field1":"<b>Tom </b> & <b>jerry</b> in the room number 1-1-1",
  "my_field2":"<b>debug_192.168.113.1_971213863506812928</b>"
}

Query-Mathch全文检索
查询时，ES会根据字段使用的分析器进行分析，然后检索。

#查询my_field2字段包含IP:192.168.113.1的文档
GET my_index/_search
{
  "query": {
    "match": {
      "my_field2": "192.168.113.1"
    }
  }
}

#返回结果
{
  "took": 22,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "my_field1": "<b>Tom </b> & <b>jerry</b> in the room number 1-1-1",
          "my_field2": "<b>debug_192.168.113.1_971213863506812928</b>"
        }
      }
    ]
  }
}

原文地址：https://www.cnblogs.com/a-du/p/10455502.html

时间： 2024-12-22 05:00:33

Elasticsearch 自定义多个分析器的相关文章

elasticsearch 自定义_id

elasticsearch 自定义ID: curl -s -XPUT localhost:9200/web -d ' { "mappings": { "blog": { "_id": { "path": "uuid" }, "properties": { "title": { "type": "string", "in

关于结巴分词 ElasticSearch 插件: https://github.com/huaban/elasticsearch-analysis-jieba 该插件由huaban开发.支持Elastic Search 版本<=2.3.5. 结巴分词分析器结巴分词插件提供3个分析器:jieba_index.jieba_search和jieba_other. jieba_index: 用于索引分词,分词粒度较细: jieba_search: 用于查询分词,分词粒度较粗: jieba_other:

elasticsearch 自定义similarity 插件开发

原文 http://www.cnblogs.com/luanfei/p/4029442.html 主题 Elastic Search 转自: http://www.chepoo.com/elasticsearch-similarity-custom-plug-in-development.html 在搜索开发中,我们要修改打分机制,就需要自定义similarity.现在来简单说一下elasticsearch下的自定义similarity 插件开发. 网上的 https://github.com

建立标准编码规则-自定义C#代码分析器

1.下载Roslyn的Visual Studio分析器模板插件(VS2015 或VS2017) https://marketplace.visualstudio.com/items?itemName=VisualStudioProductTeam.NETCompilerPlatformSDK 我后来查询到官方说明vs2017已经内嵌了此功能 Want to start developing in C# and Visual Basic? Download Visual Studio 2017,

ElasticSearch——自定义模板

output中配置 elasticsearch{ action => "index" hosts => ["xxx"] index => "http-log-logstash" document_type => "logs" template => "opt/http-logstash.json" template_name => "http-log-logst

Elasticsearch自定义脚本完成性能测试

1.ES性能测试要求: 1)完成ES并发100次性能测试: 2)统计得出访问时间结果值. 2.脚本实现 #!/bin/sh KEYWORDS_TXT="./keywords.txt" cat /dev/null > ./rst.txt echo "beginTime=`date`" cat $KEYWORDS_TXT | while read line do echo "line=$line" echo "curl -XGET

elasticsearch 自定义打分

curl -XGET 'http://localhost:9200/searchsuggestion/searchsuggestion/_search?pretty' -d '{ "fields" : ["company_full_name","id"], "size" : 10, "query": { "function_score": { "functions":

filebrat6.8多路径日志输出到elasticsearch自定义多个索引

原文地址:https://blog.51cto.com/12102819/2481956

ElasticSearch：分析器

ElasticSearch入门第七篇:分析器这是ElasticSearch 2.4 版本系列的第七篇: ElasticSearch入门第一篇:Windows下安装ElasticSearch ElasticSearch入门第二篇:集群配置 ElasticSearch入门第三篇:索引 ElasticSearch入门第四篇:使用C#添加和更新文档 ElasticSearch入门第五篇:使用C#查询文档 ElasticSearch入门第六篇:复合数据类型--数组,对象和嵌套 Elasti