elasticsearch ik分词插件的扩展字典和扩展停止词字典用法

本文引自 https://blog.csdn.net/caideb/article/details/81632154

cnblog的排版好看很多,所以在这里建一篇分享博客。

-----------------------------------------------------------------------------------------------

扩展字典中的词会被筛选出来,扩展停止词中的词会被过滤掉

1.没有加入扩展字典 停止词字典用法

1) ik分词器

[[email protected] custom]# curl -i -X GET -H ‘Content-type:application/json‘ -d ‘{"analyzer":"ik","text":"自古刀扇过背刺"}‘ http://192.168.0.110:9200/_analyze?pretty
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 725
{
  "tokens" : [ {
    "token" : "自古",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "token" : "刀",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "CN_WORD",
    "position" : 1
  }, {
    "token" : "扇",
    "start_offset" : 3,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 2
  }, {
    "token" : "过",
    "start_offset" : 4,
    "end_offset" : 5,
    "type" : "CN_CHAR",
    "position" : 3
  }, {
    "token" : "背",
    "start_offset" : 5,
    "end_offset" : 6,
    "type" : "CN_WORD",
    "position" : 4
  }, {
    "token" : "刺",
    "start_offset" : 6,
    "end_offset" : 7,
    "type" : "CN_CHAR",
    "position" : 5
  } ]
}

2) ik_smart分词器

[[email protected] custom]# curl -i -X GET -H ‘Content-type:application/json‘ -d ‘{"analyzer":"ik_smart","text":"自古刀扇过背刺"}‘ http://192.168.0.110:9200/_analyze?pretty                   HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 725
{
  "tokens" : [ {
    "token" : "自古",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "token" : "刀",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "CN_WORD",
    "position" : 1
  }, {
    "token" : "扇",
    "start_offset" : 3,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 2
  }, {
    "token" : "过",
    "start_offset" : 4,
    "end_offset" : 5,
    "type" : "CN_CHAR",
    "position" : 3
  }, {
    "token" : "背",
    "start_offset" : 5,
    "end_offset" : 6,
    "type" : "CN_WORD",
    "position" : 4
  }, {
    "token" : "刺",
    "start_offset" : 6,
    "end_offset" : 7,
    "type" : "CN_CHAR",
    "position" : 5
  } ]
}

3) ik_max_word分词器

[[email protected] custom]# curl -i -X GET -H ‘Content-type:application/json‘ -d ‘{"analyzer":"ik_max_word","text":"自古刀扇过背刺"}‘ http://192.168.0.110:9200/_analyze?pretty
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 725
{
  "tokens" : [ {
    "token" : "自古",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "token" : "刀",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "CN_WORD",
    "position" : 1
  }, {
    "token" : "扇",
    "start_offset" : 3,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 2
  }, {
    "token" : "过",
    "start_offset" : 4,
    "end_offset" : 5,
    "type" : "CN_CHAR",
    "position" : 3
  }, {
    "token" : "背",
    "start_offset" : 5,
    "end_offset" : 6,
    "type" : "CN_WORD",
    "position" : 4
  }, {
    "token" : "刺",
    "start_offset" : 6,
    "end_offset" : 7,
    "type" : "CN_CHAR",
    "position" : 5
  } ]
}

2.加入自定义字典

扩展字典:用于创建分词的字典

停止字典:用于过滤的字典,也就是说,该字典的单词或者字符串都会进行过滤

test.dic

刀扇
背刺

teststop.dic

自古
过

/analysis-ik/config/IKAnalyzer.cfg.xml

1) ik分词器

[[email protected] config]# curl -i -X GET -H ‘Content-type:application/json‘ -d ‘{"analyzer":"ik","text":"自古刀扇过背刺"}‘ http://192.168.0.110:9200/_analyze?pretty
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 728
{
  "tokens" : [ {
    "token" : "刀扇",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "token" : "刀",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "CN_WORD",
    "position" : 1
  }, {
    "token" : "扇",
    "start_offset" : 3,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 2
  }, {
    "token" : "背刺",
    "start_offset" : 5,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 3
  }, {
    "token" : "背",
    "start_offset" : 5,
    "end_offset" : 6,
    "type" : "CN_WORD",
    "position" : 4
  }, {
    "token" : "刺",
    "start_offset" : 6,
    "end_offset" : 7,
    "type" : "CN_CHAR",
    "position" : 5
  } ]
}

2) ik_smart分词器

[[email protected] config]#  curl -i -X GET -H ‘Content-type:application/json‘ -d ‘{"analyzer":"ik_smart","text":"自古刀扇过背刺"}‘ http://192.168.0.110:9200/_analyze?pretty                  HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 260
{
  "tokens" : [ {
    "token" : "刀扇",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "token" : "背刺",
    "start_offset" : 5,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 1
  } ]
}

3) ik_max_word分词器

[[email protected] config]#  curl -i -X GET -H ‘Content-type:application/json‘ -d ‘{"analyzer":"ik_max_word","text":"自古刀扇过背刺"}‘ http://192.168.0.110:9200/_analyze?pretty
HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Content-Length: 728
{
  "tokens" : [ {
    "token" : "刀扇",
    "start_offset" : 2,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "token" : "刀",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "CN_WORD",
    "position" : 1
  }, {
    "token" : "扇",
    "start_offset" : 3,
    "end_offset" : 4,
    "type" : "CN_WORD",
    "position" : 2
  }, {
    "token" : "背刺",
    "start_offset" : 5,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 3
  }, {
    "token" : "背",
    "start_offset" : 5,
    "end_offset" : 6,
    "type" : "CN_WORD",
    "position" : 4
  }, {
    "token" : "刺",
    "start_offset" : 6,
    "end_offset" : 7,
    "type" : "CN_CHAR",
    "position" : 5
  } ]
}

原文地址:https://www.cnblogs.com/geektcp/p/12263101.html

时间: 2024-10-28 10:49:53

elasticsearch ik分词插件的扩展字典和扩展停止词字典用法的相关文章

ElasticSearch安装ik分词插件

一.IK简介 IK Analyzer是一个开源的,基于java语言开发的轻量级的中文分词工具包.最初,它是以开源项目Luence为应用主体的,结合词典分词和文法分析算法的中文分词组件.从3.0版本开 始,IK发展为面向Java的公用分词组件,独立于Lucene项目,同时提供了对Lucene的默认优化实现.在2012版本中,IK实现了简单的分词 歧义排除算法,标志着IK分词器从单纯的词典分词向模拟语义分词衍化. 二.安装IK分词插件 1.获取分词的依赖包 通过git clone https://g

Elasticsearch 中文分词插件 jcseg 安装 (Ubuntu 14.04 下)

搜索可以说是开发中很常见的场景了,同样这次也一样... 之前的组合多数是选择 Mysql + Sphinx ,这次因为工作原因不再使用这种组合,虽然是老牌组合,但是确实限制诸多,而且每次配环境也是个问题,挺烦的...这次就尝试使用 Elasticsearch + Jcseg ,因为在文档检索方面 elasticsearch 做的相当不错,但是对中文环境来说就差一个很好的中文分词器,还好,国内好的中文分词器也有蛮多,但是我个人还是比较推荐 Jcseg . 好了,废话不多扯. 版本说明: elast

使用 Elasticsearch ik分词实现同义词搜索(转)

1.首先需要安装好Elasticsearch 和elasticsearch-analysis-ik分词器 2.配置ik同义词 Elasticsearch 自带一个名为 synonym 的同义词 filter.为了能让 IK 和 synonym 同时工作,我们需要定义新的 analyzer,用 IK 做 tokenizer,synonym 做 filter.听上去很复杂,实际上要做的只是加一段配置. 打开 /config/elasticsearch.yml 文件,加入以下配置: [html] vi

elasticsearch ik分词器安装

1.下载? ? 官方网站?https://github.com/medcl/elasticsearch-analysis-ik, 告诉你,可以下载源码,然后自己去编译,这样比较麻烦,可以直接它的版本库中下载编译好的历史版本 https://github.com/medcl/elasticsearch-analysis-ik/releases? 注意要下载编译好的包,而不是源码包 ? 例如得到包?elasticsearch-analysis-ik-1.9.5.zip 进入elasticsearch

Elasticsearch 7.x - IK分词器插件(ik_smart,ik_max_word)

一.安装IK分词器 Elasticsearch也需要安装IK分析器以实现对中文更好的分词支持. 去Github下载最新版elasticsearch-ik https://github.com/medcl/elasticsearch-analysis-ik/releases 将ik文件夹放在elasticsearch/plugins目录下,重启elasticsearch. Console控制台输出: [2019-09-04T08:50:23,395][INFO ][o.e.p.PluginsSer

在ElasticSearch中使用 IK 中文分词插件

我这里集成好了一个自带IK的版本,下载即用, https://github.com/xlb378917466/elasticsearch5.2.include_IK 添加了IK插件意味着你可以使用ik_smart(最粗粒度的拆分)和ik_max_word(最细粒度的拆分)两种analyzer. 你也可以从下面这个地址获取最新的IK源码,自己集成, https://github.com/medcl/elasticsearch-analysis-ik, 里面还提供了使用说明,可以很快上手. 一般使用

elasticsearch集群&&IK分词器&&同义词

wget https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.3.3/elasticsearch-2.3.3.tar.gz 集群安装: 三个节点:master,slave1,slvae2 vi elasticsearch.yml cluster.name: my-application node.name: node-3(节点独有的名称,注意唯一性

Elasticsearch实践(四):IK分词

环境:Elasticsearch 6.2.4 + Kibana 6.2.4 + ik 6.2.4 Elasticsearch默认也能对中文进行分词. 我们先来看看自带的中文分词效果: curl -XGET "http://localhost:9200/_analyze" -H 'Content-Type: application/json' -d '{"analyzer": "default","text": "今天

ElasticSearch的ik分词插件开发

摘要 本文主要介绍如何开发ElasticSearch的ik分词插件.很多时候,网上开源的分词插件不能满足业务需求,只能自己定义开发一套ik分词,let's go! ik插件,说白了,就是通过封装ik分词器,与ElasticSearch对接,让ElasticSearch能够驱动该分词器.那么,具体怎么与ElasticSearch对接呢?从下往上走,总共3步: 一.封装IK分析器 与ElasticSearch集成,分词器的配置均从ElasticSearch的配置文件读取,因此,需要重载IKAnaly