首先测试下分词尤其是中文分词功能,这个可是传统数据库如mysql,sqlserver的痛啊。
打开浏览器,并登录到http://localhost:5601,点击Dev Tools项,在Console栏输入
POST _analyze { "analyzer": "standard", "text":"Hello World ElasticSearch" }
会在右面显示返回的结果
{ "tokens": [ { "token": "hello", "start_offset": 0, "end_offset": 5, "type": "<ALPHANUM>", "position": 0 }, { "token": "world", "start_offset": 6, "end_offset": 11, "type": "<ALPHANUM>", "position": 1 }, { "token": "elasticsearch", "start_offset": 12, "end_offset": 25, "type": "<ALPHANUM>", "position": 2 } ] }
一切看上去都挺美好,等加入中文看看。
POST _analyze { "analyzer": "standard", "text":"ElasticSearch是一个很不错的全文检索软件。" }
结果是
{ "tokens": [ { "token": "elasticsearch", "start_offset": 0, "end_offset": 13, "type": "<ALPHANUM>", "position": 0 }, { "token": "是", "start_offset": 13, "end_offset": 14, "type": "<IDEOGRAPHIC>", "position": 1 }, { "token": "一", "start_offset": 14, "end_offset": 15, "type": "<IDEOGRAPHIC>", "position": 2 }, { "token": "个", "start_offset": 15, "end_offset": 16, "type": "<IDEOGRAPHIC>", "position": 3 }, { "token": "很", "start_offset": 16, "end_offset": 17, "type": "<IDEOGRAPHIC>", "position": 4 }, { "token": "不", "start_offset": 17, "end_offset": 18, "type": "<IDEOGRAPHIC>", "position": 5 }, { "token": "错", "start_offset": 18, "end_offset": 19, "type": "<IDEOGRAPHIC>", "position": 6 }, { "token": "的", "start_offset": 19, "end_offset": 20, "type": "<IDEOGRAPHIC>", "position": 7 }, { "token": "全", "start_offset": 20, "end_offset": 21, "type": "<IDEOGRAPHIC>", "position": 8 }, { "token": "文", "start_offset": 21, "end_offset": 22, "type": "<IDEOGRAPHIC>", "position": 9 }, { "token": "检", "start_offset": 22, "end_offset": 23, "type": "<IDEOGRAPHIC>", "position": 10 }, { "token": "索", "start_offset": 23, "end_offset": 24, "type": "<IDEOGRAPHIC>", "position": 11 }, { "token": "软", "start_offset": 24, "end_offset": 25, "type": "<IDEOGRAPHIC>", "position": 12 }, { "token": "件", "start_offset": 25, "end_offset": 26, "type": "<IDEOGRAPHIC>", "position": 13 } ] }
这显然不能忍啊,每个中文字都拆,基本就是不能用的节奏。google下,貌似其还有analyzer为chinese选项,测试发现结果一样。网上搜索发现这里一般用的是smartcn或是IKAnanlyzer插件,有的资料和书就推荐IKAnanlyzer,但这些资料都是基于老版本的es,我去IKAnanlyzer的github上去看了下,发现貌似太监了,所以还是用官方推荐的smartcn吧,下载安装的过程和安装其他插件一致,这里还是推荐离线包安装。安装完,应该要重启es服务才能生效。现在再试试
POST _analyze { "analyzer": "smartcn", "text":"ElasticSearch是一个很不错的全文检索软件。" }
{ "tokens": [ { "token": "elasticsearch", "start_offset": 0, "end_offset": 13, "type": "word", "position": 0 }, { "token": "是", "start_offset": 13, "end_offset": 14, "type": "word", "position": 1 }, { "token": "一个", "start_offset": 14, "end_offset": 16, "type": "word", "position": 2 }, { "token": "很", "start_offset": 16, "end_offset": 17, "type": "word", "position": 3 }, { "token": "不错", "start_offset": 17, "end_offset": 19, "type": "word", "position": 4 }, { "token": "的", "start_offset": 19, "end_offset": 20, "type": "word", "position": 5 }, { "token": "全文", "start_offset": 20, "end_offset": 22, "type": "word", "position": 6 }, { "token": "检索", "start_offset": 22, "end_offset": 24, "type": "word", "position": 7 }, { "token": "软件", "start_offset": 24, "end_offset": 26, "type": "word", "position": 8 } ] }
这下看上去河蟹多了。:)
时间: 2024-10-08 20:04:43