[solr] - spell check

solr提供了一个spell check,又叫suggestions,可以用于查询输入的自动完成功能auto-complete。

参考文献:

https://cwiki.apache.org/confluence/display/solr/Spell+Checking

http://www.cnblogs.com/ibook360/archive/2011/11/30/2269077.html

方法:



修改core的solrconfig.xml

加入这段到<config />内

    <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
      <lst name="spellchecker">
        <str name="name">wordbreak</str>
        <str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
        <str name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup</str>
        <str name="field">content</str>
        <str name="combineWords">true</str>
        <str name="breakWords">true</str>
        <int name="maxChanges">10</int>
      </lst>
    </searchComponent>
    <requestHandler name="/spellcheck" class="org.apache.solr.handler.component.SearchHandler">
      <lst name="defaults">
        <str name="spellcheck">true</str>
        <str name="spellcheck.dictionary">wordbreak</str>
        <str name="spellcheck.count">20</str>
      </lst>
      <arr name="last-components">
        <str>spellcheck</str>
      </arr>
    </requestHandler>

schema.xml配置:

<?xml version="1.0" ?>
<schema name="my core" version="1.1">

    <fieldtype name="string"  class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="tdate" class="solr.TrieDateField" precisionStep="6" positionIncrementGap="0"/>
    <fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="float" class="solr.TrieFloatField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>
    <fieldtype name="binary" class="solr.BinaryField"/>
    <fieldType name="text_cn" class="solr.TextField">
        <analyzer type="index" class="org.wltea.analyzer.lucene.IKAnalyzer" />
        <analyzer type="query" class="org.wltea.analyzer.lucene.IKAnalyzer" />
        <analyzer>
            <tokenizer class="solr.KeywordTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
    </fieldType>

    <!-- general -->
    <field name="id" type="long" indexed="true" stored="true" multiValued="false" required="true"/>
    <field name="subject" type="text_cn" indexed="true" stored="true" />
    <field name="content" type="text_cn" indexed="true" stored="true" />
    <field name="category_id" type="long" indexed="true" stored="true" />
    <field name="category_name" type="text_cn" indexed="true" stored="true" />
    <field name="last_update_time" type="tdate" indexed="true" stored="true" />
    <field name="_version_" type="long" indexed="true" stored="true"/>

     <!-- field to use to determine and enforce document uniqueness. -->
     <uniqueKey>id</uniqueKey>

     <!-- field for the QueryParser to use when an explicit fieldname is absent -->
     <defaultSearchField>subject</defaultSearchField>

     <!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
     <solrQueryParser defaultOperator="OR"/>
</schema>

关键在于这句:

        <analyzer>
            <tokenizer class="solr.KeywordTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>

意思是词组搜索

设置完xml,重启tomcat,在浏览器中运行:

http://localhost:8899/solr/mycore/spellcheck?spellcheck.build=true

运行结果:

然后在浏览器中运行:

http://localhost:8899/solr/mycore/spellcheck?q=中央&rows=0

运行结果:

Java代码:



Java bean:

package com.my.entity;

import java.util.Date;

import org.apache.solr.client.solrj.beans.Field;

public class Item {
    @Field
    private long id;
    @Field
    private String subject;
    @Field
    private String content;
    @Field("category_id")
    private long categoryId;
    @Field("category_name")
    private String categoryName;
    @Field("last_update_time")
    private Date lastUpdateTime;

    public long getId() {
        return id;
    }
    public void setId(long id) {
        this.id = id;
    }
    public String getSubject() {
        return subject;
    }
    public void setSubject(String subject) {
        this.subject = subject;
    }
    public String getContent() {
        return content;
    }
    public void setContent(String content) {
        this.content = content;
    }
    public long getCategoryId() {
        return categoryId;
    }
    public void setCategoryId(long categoryId) {
        this.categoryId = categoryId;
    }
    public String getCategoryName() {
        return categoryName;
    }
    public void setCategoryName(String categoryName) {
        this.categoryName = categoryName;
    }
    public Date getLastUpdateTime() {
        return lastUpdateTime;
    }
    public void setLastUpdateTime(Date lastUpdateTime) {
        this.lastUpdateTime = lastUpdateTime;
    }
}

测试代码:

package com.my.solr;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.Map;

import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.client.solrj.impl.XMLResponseParser;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.client.solrj.response.SpellCheckResponse;
import org.apache.solr.client.solrj.response.SpellCheckResponse.Collation;
import org.apache.solr.client.solrj.response.SpellCheckResponse.Correction;
import org.apache.solr.client.solrj.response.SpellCheckResponse.Suggestion;

import com.my.entity.Item;

public class TestSolr {

    public static void main(String[] args) throws IOException, SolrServerException {
        String url = "http://localhost:8899/solr/mycore";
        HttpSolrServer core = new HttpSolrServer(url);
        core.setMaxRetries(1);
        core.setConnectionTimeout(5000);
        core.setParser(new XMLResponseParser()); // binary parser is used by default
        core.setSoTimeout(1000); // socket read timeout
        core.setDefaultMaxConnectionsPerHost(100);
        core.setMaxTotalConnections(100);
        core.setFollowRedirects(false); // defaults to false
        core.setAllowCompression(true);

        // ------------------------------------------------------
        // remove all data
        // ------------------------------------------------------
        core.deleteByQuery("*:*");
        List<Item> items = new ArrayList<Item>();
        items.add(makeItem(1, "cpu", "this is intel cpu", 1, "cpu-intel"));
        items.add(makeItem(2, "cpu AMD", "this is AMD cpu", 2, "cpu-AMD"));
        items.add(makeItem(3, "cpu intel", "this is intel-I7 cpu", 1, "cpu-intel"));
        items.add(makeItem(4, "cpu AMD", "this is AMD 5000x cpu", 2, "cpu-AMD"));
        items.add(makeItem(5, "cpu intel I6", "this is intel-I6 cpu", 1, "cpu-intel-I6"));
        items.add(makeItem(6, "处理器", "中央处理器英特儿", 1, "cpu-intel"));
        items.add(makeItem(7, "处理器AMD", "中央处理器AMD", 2, "cpu-AMD"));
        items.add(makeItem(8, "中央处理器", "中央处理器Intel", 1, "cpu-intel"));
        items.add(makeItem(9, "中央空调格力", "格力中央空调", 3, "air"));
        items.add(makeItem(10, "中央空调海尔", "海尔中央空调", 3, "air"));
        items.add(makeItem(11, "中央空调美的", "美的中央空调", 3, "air"));
        core.addBeans(items);
        // commit
        core.commit();

        // ------------------------------------------------------
        // search
        // ------------------------------------------------------
        SolrQuery query = new SolrQuery();
        String token = "中央";
        query.set("qt", "/spellcheck");
        query.set("q", token);
        query.set("spellcheck", "on");
        query.set("spellcheck.build", "true");
        query.set("spellcheck.onlyMorePopular", "true");

        query.set("spellcheck.count", "100");
        query.set("spellcheck.alternativeTermCount", "4");
        query.set("spellcheck.onlyMorePopular", "true");

        query.set("spellcheck.extendedResults", "true");
        query.set("spellcheck.maxResultsForSuggest", "5");

        query.set("spellcheck.collate", "true");
        query.set("spellcheck.collateExtendedResults", "true");
        query.set("spellcheck.maxCollationTries", "5");
        query.set("spellcheck.maxCollations", "3");

        QueryResponse response = null;

        try {
            response = core.query(query);
            System.out.println("查询耗时:" + response.getQTime());
        } catch (SolrServerException e) {
            System.err.println(e.getMessage());
            e.printStackTrace();
        } catch (Exception e) {
            System.err.println(e.getMessage());
            e.printStackTrace();
        } finally {
            core.shutdown();
        }

        SpellCheckResponse spellCheckResponse = response.getSpellCheckResponse();
        if (spellCheckResponse != null) {
            List<Suggestion> suggestionList = spellCheckResponse.getSuggestions();
            for (Suggestion suggestion : suggestionList) {
                System.out.println("Suggestions NumFound: " + suggestion.getNumFound());
                System.out.println("Token: " + suggestion.getToken());
                System.out.print("Suggested: ");
                List<String> suggestedWordList = suggestion.getAlternatives();
                for (String word : suggestedWordList) {
                    System.out.println(word + ", ");
                }
                System.out.println();
            }
            System.out.println();
            Map<String, Suggestion> suggestedMap = spellCheckResponse.getSuggestionMap();
            for (Map.Entry<String, Suggestion> entry : suggestedMap.entrySet()) {
                System.out.println("suggestionName: " + entry.getKey());
                Suggestion suggestion = entry.getValue();
                System.out.println("NumFound: " + suggestion.getNumFound());
                System.out.println("Token: " + suggestion.getToken());
                System.out.print("suggested: ");

                List<String> suggestedList = suggestion.getAlternatives();
                for (String suggestedWord : suggestedList) {
                    System.out.print(suggestedWord + ", ");
                }
                System.out.println("\n\n");
            }

            Suggestion suggestion = spellCheckResponse.getSuggestion(token);
            System.out.println("NumFound: " + suggestion.getNumFound());
            System.out.println("Token: " + suggestion.getToken());
            System.out.print("suggested: ");
            List<String> suggestedList = suggestion.getAlternatives();
            for (String suggestedWord : suggestedList) {
                System.out.print(suggestedWord + ", ");
            }
            System.out.println("\n\n");

            System.out.println("The First suggested word for solr is : " + spellCheckResponse.getFirstSuggestion(token));
            System.out.println("\n\n");

            List<Collation> collatedList = spellCheckResponse.getCollatedResults();
            if (collatedList != null) {
                for (Collation collation : collatedList) {
                    System.out.println("collated query String: " + collation.getCollationQueryString());
                    System.out.println("collation Num: " + collation.getNumberOfHits());
                    List<Correction> correctionList = collation.getMisspellingsAndCorrections();
                    for (Correction correction : correctionList) {
                        System.out.println("original: " + correction.getOriginal());
                        System.out.println("correction: " + correction.getCorrection());
                    }
                    System.out.println();
                }
            }
            System.out.println();
            System.out.println("The Collated word: " + spellCheckResponse.getCollatedResult());
            System.out.println();
        }

        System.out.println("查询耗时:" + response.getQTime());
    }

    private static Item makeItem(long id, String subject, String content, long categoryId, String categoryName) {
        Item item = new Item();
        item.setId(id);
        item.setSubject(subject);
        item.setContent(content);
        item.setLastUpdateTime(new Date());
        item.setCategoryId(categoryId);
        item.setCategoryName(categoryName);
        return item;
    }
}

测试结果:

这种方式可以使用于对现在数据内容的查询拼写检查。

时间: 2025-01-07 03:40:32

[solr] - spell check的相关文章

POJ 1035 Spell Check 字符串处理

被这样的题目忽悠了,一开始以为使用Trie会大大加速程序的,没想到,一不小心居然使用Trie会超时. 最后反复试验,加点优化,终于使用Trie是可以过的,不过时间大概难高于1500ms,一不小心就会超时. 看来这是一道专门卡Trie的题目,只好放弃不使用Trie了. 也得出点经验,如果字符串很多,如本题有1万个字符串的,那么还是不要使用Trie吧,否则遍历一次这样的Trie是十分耗时的,2s以上. 于是使用暴力搜索法了,这里的技巧是可以使用优化技术: 1 剪枝:这里直接分组搜索,分组是按照字符串

solr拼写检查配置

拼写检查功能,能在搜索时,提供一个较好用户体验,所以,主流的搜索引擎都有这个功能. 那么什么是拼写检查,其实很好理解,就是你输入的搜索词,可能是你输错了,也有可能在它的检索库里面根本不存在这个词,但是这时候它能给你返回,相似或相近的结果来帮助你校正.举个例子,假如你在百度里面输入在在线电瓶,可能它的索引库里面就没有,但是它有可能返回在线电影,在线电视,在线观看等等一些词,这些,就用到拼写检查的功能了. solr是一个基于lucene开发接口实现的成熟的搜索系统,通过不同的控件(Component

Importing/Indexing database (MySQL or SQL Server) in Solr using Data Import Handler--转载

原文地址:https://gist.github.com/maxivak/3e3ee1fca32f3949f052 Install Solr download and install Solr from http://lucene.apache.org/solr/. you can access Solr admin from your browser: http://localhost:8983/solr/ use the port number used in installation. M

solr入门之solr的拼写检查功能的应用级别尝试

今天主要是收集了些拼写检查方面的资料和 尝试使用一下拼写检查的功能--=遇到了不少问题 拼写检查的四种配置目前我只算是成功了半个吧 --------------------------------- 拼写检查功能,能在搜索时,提供一个较好用户体验,所以,主流的搜索引擎都有这个功能.在这之前,笔者先简单的说一下什么是拼写检查,其实很好理解,就是你输入的搜索词,可能是你输错了,也有可能在它的检索库里面根本不存在这个词,但是这时候它能给你返回,相似或相近的结果来帮助你校正. 举个例子,假如你在百度里面

集成Nutch/Hbase/Solr构建搜索引擎

1.下载相关软件,并解压 版本号如下: (1)apache-nutch-2.2.1 (2) hbase-0.90.4 (3)solr-4.9.0 并解压至/usr/search 2.Nutch的配置 (1)vi /usr/search/apache-nutch-2.2.1/conf/nutch-site.xml <property> <name>storage.data.store.class</name> <value>org.apache.gora.hb

solr4.3 solrconfig.xml配置文件

<?xml version="1.0" encoding="UTF-8" ?> <config>    <!--表示solr底层使用的是lucene版本-->   <luceneMatchVersion>LUCENE_43</luceneMatchVersion>      <!-- 表示solr引用包的位置,当dir对应的目录不存在时候,会忽略此属性-->   <lib dir=&quo

MarkDown案例

Welcome to MarkdownPad 2 MarkdownPad is a full-featured Markdown editor for Windows. Built exclusively for Markdown Enjoy first-class Markdown support with easy access to Markdown syntax and convenient keyboard shortcuts. Give them a try: Bold (Ctrl+

计算机专业英语 (五)

第一部分:基本单词 microprocessor n 微处理器 server n 服务器 brand n 商标,牌子 technique n 技术,技巧,方法 engine n 引擎,发动机 fabricate vt 制作,制造 chip n 芯片 subtract v 减 discrete adj 离散的,分离的,不连续的 transister n 晶体管 code n 代码,编码 v 编码 processor n 处理机,处理器 release n 版本,发布 v 发布,发行 micron

Go语言(golang)开源项目大全

转http://www.open-open.com/lib/view/open1396063913278.html内容目录Astronomy构建工具缓存云计算命令行选项解析器命令行工具压缩配置文件解析器控制台用户界面加密数据处理数据结构数据库和存储开发工具分布式/网格计算文档编辑器Encodings and Character SetsGamesGISGo ImplementationsGraphics and AudioGUIs and Widget ToolkitsHardwareLangu