1.6.7 Detecting Languages During Indexing

1. Detecting Languages During Indexing

　　在索引的时候,solr可以使用langid UpdateRequestProcessor来识别语言,然后映射文本到特定语言的字段.solr支持这个功能的两个实现:

Tika的语言解析功能:http://tika.apache.org/0.10/detection.html
LangDetect语言解析:http://code.google.com/p/language-detection/

　　可以从 http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html中看到它们之间的对比.一般情况下,LangDetect支持更多的语言,具有更高的性能.

　　参考http://wiki.apache.org/solr/LanguageDetection获取更多的关于langid UpdateRequestProcessor信息.

1.1 Configuring Language Detection

　　可以在solrconfig.xml中配置langid UpdateRequestProcessor.两个实现具有相同的参数,最少,你需要指定语言识别的字段和字段的结果语言编码.

1.2 Configuring Tika Language Detection

　　这里是solrconfig.xml 中 Tika langid UpdateRequestProcessor的最小的配置.

<processor
    class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
    <lst name="defaults">
        <str name="langid.fl">title,subject,text,keywords</str>
        <str name="langid.langField">language_s</str>
    </lst>
</processor>

1.3 Configuring LangDetect Language Detection

　　这里是solrconfig.xml中最小的LangDetect langid配置.

<processor
    class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFac
tory">
    <lst name="defaults">
        <str name="langid.fl">title,subject,text,keywords</str>
        <str name="langid.langField">language_s</str>
    </lst>
</processor>

1.4 langid Parameters

　　正如上面所提到的,两个langid UpdateRequestProcessor的实现具有相同的参数:

参数	类型	默认	必填	描述
langid	Boolean	true	no	开启/关闭语言解析
langid.fl	string	none	yes	逗号或者空格分隔的字段列表.用于语言探测解析.
langid.langField	string	none	yes	对返回的语言代码指定字段
langid.langsField	multivalued string	none	no	对返回的语言代码指定字段.如果使用langid.map.individual,每一个解析的语言都被添加到这个字段.
langid.overwrite	Boolean	false	no	指定langField和langsField字段的内容是否被重写.如果它们包含值的话.
langid.lcmap	string	none	false	空格分隔的列表,指定冒号分隔的语言代码用于语言解析.举例,你可以能使用这个映射中文,日文,韩文到一个cjk字段,并且映射美国英语和英国英语到一个en代码.可以使用langid.lcmap=ja:cjk zh:cjk ko:cjk . This affects both the values put into the en_GB:en en_US:en.这使这两个值放入到langField和langsField字段中.
langid.threshold	float	0.5	no	Specifies a threshold value between 0 and 1 that the language identification score must reach before accepts it. With longer langid text fields, a high threshold such at 0.8 will give good results. For shorter text fields, you may need to lower the threshold for language identification, though you will be risking somewhat lower quality results. We recommend experimenting with your data to tune your results.
langid.whitelist	string	none	no	Specifies a list of allowed language identification codes. Use this in combination with to ensure that you only index langid.map documents into fields that are in your schema.
langid.map	Boolean	false	no	Enables field name mapping. If true, Solr will map field names for all fields listed in . langid.fl
langid.map.fl	string	none	no	A comma-separated list of fields for that is different langid.map than the fields specified in . langid.fl
langid.map.keepOrig	Boolean	false	no	If true, Solr will copy the field during the field name mapping process, leaving the original field in place.
langid..map.individual	Boolean	false	no	If true, Solr will detect and map languages for each field individually
langid.map.individual.fl	stromh	none	no	逗号分割的字段列表,使用 langid.map.individual.不同于langid.fl中指定的字段.
langid.fallbackFields	string	none	no	If no language is detected that meets the score langid.threshold , or if the detected language is not on the , this langid.whitelist field specifies language codes to be used as fallback values. If no appropriate fallback languages are found, Solr will use the language code specified in .
langid.fallback	string	none	no	Specifies a language code to use if no language is detected or specified in .
langid.map.lcmap	string	determined by langid.lcmap	no	A space-separated list specifying colon delimited language code mappings to use when mapping field names. For example, you might use this to make Chinese, Japanese, and Korean language fields use a common suffix, and map both American and British English _cjk fields to a single by using _en langid.map.lcmap=ja:cjk . zh:cjk ko:cjk en_GB:en en_US:en
langid.map.pattern	Java regular expression	none	no	By default, fields are mapped as <field>_<language>. To change this pattern, you can specify a Java regular expression in this parameter.
langid.map.replace	Java replace	none	no	By default, fields are mapped as <field>_<language>. To change this pattern, you can specify a Java replace in this parameter.
langid.enforceSchema	Boolean	true	no	If false, the processor does not validate field names against langid your schema. This may be useful if you plan to rename or delete fields later in the UpdateChain

时间： 2024-10-12 05:31:02

1.6.7 Detecting Languages During Indexing的相关文章

1.6 Indexing and Basic Data Operations--目录

1.6.1 什么是 Indexing 1.6.2 Uploading Data with Index Handlers 1.6.3 Uploading Data with Solr Cell using Apache Tika 1.6.4 Uploading Structured Data Store Data with the Data Import Handler 1.6.5 Updating Parts of Documents 1.6.6 De-Duplication(重复数据删除) 1

Importing/Indexing database (MySQL or SQL Server) in Solr using Data Import Handler--转载

原文地址:https://gist.github.com/maxivak/3e3ee1fca32f3949f052 Install Solr download and install Solr from http://lucene.apache.org/solr/. you can access Solr admin from your browser: http://localhost:8983/solr/ use the port number used in installation. M

solr4.3 solrconfig.xml配置文件

<?xml version="1.0" encoding="UTF-8" ?> <config>  <luceneMatchVersion>LUCENE_43</luceneMatchVersion>  <lib dir=&quo

Indexing Sensor Data

In particular embodiments, a method includes, from an indexer in a sensor network, accessing a set of sensor data that includes sensor data aggregated together from sensors in the sensor network, one or more time stamps for the sensor data, and metad

The Languages and Frameworks You Should Learn in 2017

Martin Angelov December 8th, 2016 The software development industry continues its relentless march forward. In 2016 we saw new releases of popular languages, frameworks and tools that give us more power and change the way we work. It is difficult to

CodeForces 277A Learning Languages 并查集

The "BerCorp" company has got n employees. These employees can use m approved official languages for the formal correspondence. The languages are numbered with integers from 1 to m. For each employee we have the list of languages, which he knows

stack smashing detecting

stack smashing aborted 堆猛烈撞击流失我在使用数据时写了 tmp_row = row + pos[num1][[0]; tmp_col = col + pos[num1][1]; if(map[tmp_row][tmp_col] != -1)map[tmp_row][tmp_col]++;这句错了忽略了 map[tmp_row][tmp_col]出现在map 以外的情况, if(tmp_row >= 0 && tmp_row < 10)&&

论文阅读（BaiXiang——【CVPR2012】Detecting Texts of Arbitrary Orientations in Natural Images）

BaiXiang--[CVPR2012]Detecting Texts of Arbitrary Orientations in Natural Images 目录作者和相关链接方法概括方法细节创新点和贡献实验结果问题讨论总结与收获点作者和相关链接华科:姚聪(Cong Yao),白翔(Xiang Bai),刘文予(Wenyu Liu) 微软MSRA:马毅(Yi Ma) UCLA(加州大学圣地亚哥分校):屠卓文(Zhuowen Tu) 文章中提到的MSRA-TD 500 数据库

VS2015 Offline Help Content is now available in 10 more languages!

https://blogs.msdn.microsoft.com/devcontentloc/2015/10/21/vs2015-offline-help-content-is-now-available-in-10-more-languages/ The Cloud and Enterprise International team has published Offline Help Content of Visual Studio 2015 for 10 languages. Now yo