昨天了解了suggest包中的spell相关的内容,主要是拼写检查和相似度查询提示;
今天准备了解下关于联想词的内容,lucene的联想词是在org.apache.lucene.search.suggest包下边,提供了自动补全或者联想提示功能的支持;
InputIterator说明
InputIterator是一个支持枚举term,weight,payload三元组的供suggester使用的接口,目前仅支持AnalyzingSuggester,FuzzySuggester
andAnalyzingInfixSuggester
三种suggester支持payloads;
InputIterator的实现类有以下几种:
BufferedInputIterator:对二进制类型的输入进行轮询;
DocumentInputIterator:从索引中被store的field中轮询;
FileIterator:从文件中每次读出单行的数据轮询,以\t进行间隔(且\t的个数最多为2个);
HighFrequencyIterator:从索引中被store的field轮询,忽略长度小于设定值的文本;
InputIteratorWrapper:遍历BytesRefIterator并且返回的内容不包含payload且weight均为1;
SortedInputIterator:二进制类型的输入轮询且按照指定的comparator算法进行排序;
InputIterator提供的方法如下:
weight():此方法设置某个term的权重,设置的越高suggest的优先级越高;
payload():每个suggestion对应的元数据的二进制表示,我们在传输对象的时候需要转换对象或对象的某个属性为BytesRef类型,相应的suggester调用lookup的时候会返回payloads信息;
hasPayload():判断iterator是否有payloads;
contexts():获取某个term的contexts,用来过滤suggest的内容,如果suggest的列表为空,返回null
hasContexts():获取iterator是否有contexts;
Suggester查询工具Lookup类说明
此类提供了字符串的联想查询功能
Lookup类提供了一个CharSequenceComparator,此comparator主要是用来对CharSequence进行排序,按字符顺序排序;
内置LookupResult,用于返回suggest的结果,同时也是按照CharSequenceComparator进行key的排序;
内置了LookupPriorityQueue,用以存储LookupResult;
LookUp提供的方法
build(Dictionary dict) : 从指定directory进行build;
load(InputStream input) : 将InputStream转成DataInput并执行load(DataInput)方法;
store(OutputStream output) : 将OutputStream转成DataOutput并执行store(DataOutput)方法;
getCount() : 获取lookup的build的项的数量;
build(InputIterator inputIterator) : 根据指定的InputIterator构建Lookup对象;
lookup(CharSequence key, boolean onlyMorePopular, int num) :根据key查询可能的结果返回值为List<LookupResult>;
Lookup的相关实现如下:
编写自己的suggest模块
注意:在suggest的时候我们需要导入lucene-misc-5.1.0.jar否则系统会提示类SortedMergePolicy没有找到;
首先我们定义自己的实体类:
package com.lucene.suggest; import java.io.Serializable; public class Product implements Serializable { private static final long serialVersionUID = 1L; private String name; private String image; private String[] regions; private int numberSold; public Product(String name, String image, String[] regions, int numberSold) { this.name = name; this.image = image; this.regions = regions; this.numberSold = numberSold; } public String getName() { return name; } public void setName(String name) { this.name = name; } public String getImage() { return image; } public void setImage(String image) { this.image = image; } public String[] getRegions() { return regions; } public void setRegions(String[] regions) { this.regions = regions; } public int getNumberSold() { return numberSold; } public void setNumberSold(int numberSold) { this.numberSold = numberSold; } }
然后定义InputIterator这里定义消费者是List<Object>,并对list进行遍历放入payload中:
package com.lucene.suggest; import java.io.ByteArrayOutputStream; import java.io.IOException; import java.io.ObjectOutputStream; import java.io.UnsupportedEncodingException; import java.util.Comparator; import java.util.HashSet; import java.util.Iterator; import java.util.Set; import org.apache.lucene.search.suggest.InputIterator; import org.apache.lucene.util.BytesRef; public class ProductIterator implements InputIterator { private Iterator<Product> productIterator; private Product currentProduct; ProductIterator(Iterator<Product> productIterator) { this.productIterator = productIterator; } public boolean hasContexts() { return true; } /** * 是否有设置payload信息 */ public boolean hasPayloads() { return true; } public Comparator<BytesRef> getComparator() { return null; } public BytesRef next() { if (productIterator.hasNext()) { currentProduct = productIterator.next(); try { return new BytesRef(currentProduct.getName().getBytes("UTF8")); } catch (UnsupportedEncodingException e) { throw new RuntimeException("Couldn't convert to UTF-8",e); } } else { return null; } } public BytesRef payload() { try { ByteArrayOutputStream bos = new ByteArrayOutputStream(); ObjectOutputStream out = new ObjectOutputStream(bos); out.writeObject(currentProduct); out.close(); return new BytesRef(bos.toByteArray()); } catch (IOException e) { throw new RuntimeException("Well that's unfortunate."); } } public Set<BytesRef> contexts() { try { Set<BytesRef> regions = new HashSet<BytesRef>(); for (String region : currentProduct.getRegions()) { regions.add(new BytesRef(region.getBytes("UTF8"))); } return regions; } catch (UnsupportedEncodingException e) { throw new RuntimeException("Couldn't convert to UTF-8"); } } public long weight() { return currentProduct.getNumberSold(); } }
编写测试类
package com.lucene.suggest; import java.io.ByteArrayInputStream; import java.io.IOException; import java.io.ObjectInputStream; import java.nio.file.Paths; import java.util.ArrayList; import java.util.HashSet; import java.util.List; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.search.suggest.Lookup.LookupResult; import org.apache.lucene.search.suggest.analyzing.AnalyzingInfixSuggester; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; import org.apache.lucene.util.BytesRef; public class SuggestProducts { private static void lookup(AnalyzingInfixSuggester suggester, String name, String region) throws IOException { HashSet<BytesRef> contexts = new HashSet<BytesRef>(); contexts.add(new BytesRef(region.getBytes("UTF8"))); List<LookupResult> results = suggester.lookup(name, contexts, 2, true, false); System.out.println("-- \"" + name + "\" (" + region + "):"); for (LookupResult result : results) { System.out.println(result.key); BytesRef bytesRef = result.payload; ObjectInputStream is = new ObjectInputStream(new ByteArrayInputStream(bytesRef.bytes)); Product product = null; try { product = (Product)is.readObject(); } catch (ClassNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } System.out.println("product-Name:" + product.getName()); System.out.println("product-regions:" + product.getRegions()); System.out.println("product-image:" + product.getImage()); System.out.println("product-numberSold:" + product.getNumberSold()); } System.out.println(); } public static void main(String[] args) { try { Directory indexDir = FSDirectory.open(Paths.get("suggestPath", new String[0])); StandardAnalyzer analyzer = new StandardAnalyzer(); AnalyzingInfixSuggester suggester = new AnalyzingInfixSuggester(indexDir, analyzer); ArrayList<Product> products = new ArrayList<Product>(); products.add(new Product("Electric Guitar", "http://images.example/electric-guitar.jpg", new String[] { "US", "CA" }, 100)); products.add(new Product("Electric Train", "http://images.example/train.jpg", new String[] { "US", "CA" }, 100)); products.add(new Product("Acoustic Guitar", "http://images.example/acoustic-guitar.jpg", new String[] { "US", "ZA" }, 80)); products.add(new Product("Guarana Soda", "http://images.example/soda.jpg", new String[] { "ZA", "IE" }, 130)); suggester.build(new ProductIterator(products.iterator())); lookup(suggester, "Gu", "US"); lookup(suggester, "Gu", "ZA"); lookup(suggester, "Gui", "CA"); lookup(suggester, "Electric guit", "US"); suggester.refresh(); } catch (IOException e) { System.err.println("Error!"); } } }
相关代码会在明天放出
一步一步跟我学习lucene是对近期做lucene索引的总结,大家有问题的话联系本人的Q-Q: 891922381,同时本人新建Q-Q群:106570134(lucene,solr,netty,hadoop),如蒙加入,不胜感激,大家共同探讨,本人争取每日一博,希望大家持续关注,会带给大家惊喜的