Lucene系列：（6）分词器

1、什么是分词器

采用一种算法，将中英文本中的字符拆分开来，形成词汇，以待用户输入关健字后搜索

2、为什么要分词器

因为用户输入的搜索的内容是一段文本中的一个关健字，和原始表中的内容有差别，但作为搜索引擎来讲，又得将相关的内容搜索出来，此时就得采用分词器来最大限度匹配原始表中的内容。

3、分词器工作流程

（1）按分词器拆分出词汇

（2）去除停用词和禁用词

（3）如果有英文，把英文字母转为小写，即搜索不分大小写

4、演示常用分词器测试

这里测试需要引入IKAnalyzer3.2.0Stable.jar

package com.rk.lucene.b_analyzer;

import java.io.StringReader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cjk.CJKAnalyzer;
import org.apache.lucene.analysis.cn.ChineseAnalyzer;
import org.apache.lucene.analysis.fr.FrenchAnalyzer;
import org.apache.lucene.analysis.ru.RussianAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.wltea.analyzer.lucene.IKAnalyzer;

import com.rk.lucene.utils.LuceneUtils;

/**
 * 测试Lucene内置和第三方分词器的分词效果
 */
public class TestAnalyzer {
	private static void testAnalyzer(Analyzer analyzer, String text) throws Exception{
		System.out.println("当前使用的分词器：" + analyzer.getClass());
		TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(text));
		tokenStream.addAttribute(TermAttribute.class);
		while(tokenStream.incrementToken()){
			TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);
			System.out.print(termAttribute.term() + " |");
		}
		System.out.println();
	}

	public static void main(String[] args) throws Exception {
		//Lucene内置的分词器
		System.out.println("------------------------Lucene内置的分词器");
		testAnalyzer(new StandardAnalyzer(LuceneUtils.getVersion()), "Mozilla Firefox，中文俗称“火狐”，是一个自由及开放源代码的网页浏览器it呀");
		testAnalyzer(new FrenchAnalyzer(LuceneUtils.getVersion()), "Mozilla Firefox，中文俗称“火狐”，是一个自由及开放源代码的网页浏览器it呀");
		testAnalyzer(new FrenchAnalyzer(LuceneUtils.getVersion()), "Mozilla Firefox，中文俗称“火狐”，是一个自由及开放源代码的网页浏览器how are you呀");
		testAnalyzer(new RussianAnalyzer(LuceneUtils.getVersion()), "Mozilla Firefox，中文俗称“火狐”，是一个自由及开放源代码的网页浏览器it呀");
		testAnalyzer(new CJKAnalyzer(LuceneUtils.getVersion()), "Mozilla Firefox，中文俗称“火狐”，是一个自由及开放源代码的网页浏览器it呀");
		testAnalyzer(new ChineseAnalyzer(), "Mozilla Firefox，中文俗称“火狐”，是一个自由及开放源代码的网页浏览器it呀");
		//IKAnalyzer分词器
		System.out.println("------------------------IKAnalyzer分词器");
		testAnalyzer(new IKAnalyzer(), "Mozilla Firefox，中文俗称“火狐”，是一个自由及开放源代码的网页浏览器it呀");
		testAnalyzer(new IKAnalyzer(), "上海自来水来自海上");
	}
}

输出结果：

------------------------Lucene内置的分词器
当前使用的分词器：class org.apache.lucene.analysis.standard.StandardAnalyzer
mozilla |firefox |中 |文 |俗 |称 |火 |狐 |是 |一 |个 |自 |由 |及 |开 |放 |源 |代 |码 |的 |网 |页 |浏 |览 |器 |呀 |
当前使用的分词器：class org.apache.lucene.analysis.fr.FrenchAnalyzer
mozill |firefox |中 |文 |俗 |称 |火 |狐 |是 |一 |个 |自 |由 |及 |开 |放 |源 |代 |码 |的 |网 |页 |浏 |览 |器 |it |呀 |
当前使用的分词器：class org.apache.lucene.analysis.fr.FrenchAnalyzer
mozill |firefox |中 |文 |俗 |称 |火 |狐 |是 |一 |个 |自 |由 |及 |开 |放 |源 |代 |码 |的 |网 |页 |浏 |览 |器 |how |are |you |呀 |
当前使用的分词器：class org.apache.lucene.analysis.ru.RussianAnalyzer
mozilla |firefox |中文俗称 |火狐 |是一个自由及开放源代码的网页浏览器it呀 |
当前使用的分词器：class org.apache.lucene.analysis.cjk.CJKAnalyzer
mozilla |firefox |中文 |文俗 |俗称 |火狐 |是一 |一个 |个自 |自由 |由及 |及开 |开放 |放源 |源代 |代码 |码的 |的网 |网页 |页浏 |浏览 |览器 |呀 |
当前使用的分词器：class org.apache.lucene.analysis.cn.ChineseAnalyzer
mozilla |firefox |中 |文 |俗 |称 |火 |狐 |是 |一 |个 |自 |由 |及 |开 |放 |源 |代 |码 |的 |网 |页 |浏 |览 |器 |呀 |
------------------------IKAnalyzer分词器
当前使用的分词器：class org.wltea.analyzer.lucene.IKAnalyzer
mozilla |firefox |中文 |俗称 |火狐 |一个 |一 |个 |自由 |开放源代码 |开放 |源代码 |代码 |网页 |浏览器 |浏览 |呀 |
当前使用的分词器：class org.wltea.analyzer.lucene.IKAnalyzer
上海 |自来水 |自来 |来自 |海上 |

5、使用第三方IKAnalyzer分词器（中文首选）

需求：过滤掉上面例子中的“说”，“的”，“呀”，且将“传智播客”看成一个整体关健字

（1）导入IKAnalyzer分词器核心jar包，IKAnalyzer3.2.0Stable.jar

（2）将IKAnalyzer.cfg.xml和stopword.dic和mydict.dic文件复制到MyEclipse的src目录下，

再进行配置，在配置时，首行需要一个空行

TestIK.java

package com.rk.lucene.c_ik;

import java.io.StringReader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.wltea.analyzer.lucene.IKAnalyzer;

public class TestIK {

	private static void testAnalyzer(Analyzer analyzer, String text) throws Exception{
		System.out.println("当前使用的分词器：" + analyzer.getClass());
		TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(text));
		tokenStream.addAttribute(TermAttribute.class);
		while(tokenStream.incrementToken()){
			TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);
			System.out.print(termAttribute.term() + " |");
		}
		System.out.println();
	}

	public static void main(String[] args) throws Exception {
		testAnalyzer(new IKAnalyzer(), "Mozilla Firefox，中文俗称“火狐”，是一个自由及开放源代码的网页浏览器it呀");
		testAnalyzer(new IKAnalyzer(), "上海自来水来自海上");
		testAnalyzer(new IKAnalyzer(), "传智播客教育科技有限公司是一家专门致力于高素质软件开发人才培养的IT公司");
		testAnalyzer(new IKAnalyzer(), "how are you");
	}
}

输出结果：

当前使用的分词器：class org.wltea.analyzer.lucene.IKAnalyzer
mozilla |firefox |中文 |俗称 |火狐 |一个 |一 |个 |自由 |开放源代码 |开放 |源代码 |代码 |网页 |浏览器 |浏览 |呀 |
当前使用的分词器：class org.wltea.analyzer.lucene.IKAnalyzer
上海 |自来水 |自来 |来自 |海上 |
当前使用的分词器：class org.wltea.analyzer.lucene.IKAnalyzer
传智播客 |教育科 |教育 |科技 |有限公司 |有限 |公司 |一家 |一 |家专 |家 |专门 |致力于 |致力 |高素质 |素质 |软件开发 |软件 |开发 |发人 |人才培养 |人才 |培养 |公司 |
当前使用的分词器：class org.wltea.analyzer.lucene.IKAnalyzer
how |you |

IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">  
<properties>  
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 --> 
	<entry key="ext_dict">/mydict.dic;</entry> 

	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords">/ext_stopword.dic</entry> 
</properties>

mydict.dic

传智播客

ext_stopword.dic

也
了
仍
从
以
使
则
却
又
及
对
就
并
很
或
把
是
的
着
给
而
被
让
在
还
比
等
当
与
于
但

时间： 2024-12-19 03:20:56

Lucene系列：（6）分词器的相关文章

搜索引擎系列四：Lucene提供的分词器、IKAnalyze中文分词器集成

一.Lucene提供的分词器StandardAnalyzer和SmartChineseAnalyzer 1.新建一个测试Lucene提供的分词器的maven项目LuceneAnalyzer 2. 在pom.xml里面引入如下依赖  <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-core</artifactId&

重写lucene.net的分词器支持3.0.3.0版本

lucene.net中每个分词器都是一个类,同时有一个辅助类,这个辅助类完成分词的大部分逻辑.分词类以Analyzer结尾,辅助类通常以Tokenizer结尾.分类词全部继承自Analyzer类,辅助类通常也会继承某个类. 首先在Analysis文件夹下建立两个类,EasyAnalyzer和EasyTokenizer. 1 using Lucene.Net.Analysis; 2 using System.IO; 3 4 namespace LuceneNetTest 5 { 6 public

lucene构建同义词分词器

lucene4.0版本号以后已经用TokenStreamComponents 代替了TokenStream流.里面包含了filter和tokenizer 在较复杂的lucene搜索业务场景下,直接网上下载一个作为项目的分词器,是不够的.那么怎么去评定一个中文分词器的好与差:一般来讲.有两个点.词库和搜索效率,也就是算法. lucene的倒排列表中,不同的分词单元有不同的PositionIncrementAttribute,假设两个词之间PositionIncrementAttribute距离

Lucene的中文分词器IKAnalyzer

分词器对英文的支持是非常好的. 一般分词经过的流程: 1)切分关键词 2)去除停用词 3)把英文单词转为小写但是老外写的分词器对中文分词一般都是单字分词,分词的效果不好. 国人林良益写的IK Analyzer应该是最好的Lucene中文分词器之一,而且随着Lucene的版本更新而不断更新,目前已更新到IK Analyzer 2012版本. IK Analyzer是一个开源的,基于java语言开发的轻量级的中文分词工具包.到现在,IK发展为面向Java的公用分词组件,独立于Lucene项目,同时

Lucene实现自定义分词器(同义词查询与高亮)

今天我们实现一个简单的分词器,仅仅做演示使用功能如下: 1.分词按照空格.横杠.点号进行拆分: 2.实现hi与hello的同义词查询功能: 3.实现hi与hello同义词的高亮显示: MyAnalyzer实现代码: public class MyAnalyzer extends Analyzer { private int analyzerType; public MyAnalyzer(int type) { super(); analyzerType = type; } @Override p

Lucene 4.4.0中常用的几个分词器

作者:ceclar123 推荐:刘超觉先 package bond.lucene.analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.core.WhitespaceAnalyzer; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene

自定义Lucene分词器示例

集团的内部通讯工具搜同事时,需要根据姓名后缀进行搜索.譬如"徐欢春",我们要能根据"欢春"搜出这个人:"黄继刚",要根据"继刚"为关键字搜出"黄继刚".这是个很人性化的用户体验,当我们有同事的名字是三个字的时候,我们通常会叫他们名字的最后两个字.Lucene本身并没有提供这种分词器,只能自己照着Lucene已有的分词器进行模仿开发. 参照ngram分词器进行开发. 实现一个Tokenizer和一个工厂类就可

solr8.0 ik中文分词器的简单配置（二）

下载ik分词器,由于是solr8.0,一些ik分词器版本可能不兼容,以下是个人亲测可行的版本链接:https://pan.baidu.com/s/1_Va-9af-jMcqepGQ9nWo3Q 提取码:0a3y 然后将解压出来的两个jar包放到以下路径: 其它的三个文件放到以下路径: 如果没有classes文件夹就创建一个然后进行ik分词器的配置,编辑以下路径的managed-schema文件将以下配置放到后边  <fieldType name=&quo

基于lucene的案例开发：分词器介绍

转载请注明出处:http://blog.csdn.net/xiaojimanman/article/details/42916755 在lucene创建索引的过程中,数据信息的处理是一个十分重要的过程,在这一过程中,主要的部分就是这一篇博客的主题:分词器.在下面简单的demo中,介绍了7中比较常见的分词技术,即:CJKAnalyzer.KeywordAnalyzer.SimpleAnalyzer.StopAnalyzer.WhitespaceAnalyzer.StandardAnalyzer.I