集成Nutch/Hbase/Solr构建搜索引擎

1、下载相关软件,并解压

版本号如下:

(1)apache-nutch-2.2.1

(2) hbase-0.90.4

(3)solr-4.9.0

并解压至/usr/search

2、Nutch的配置

(1)vi /usr/search/apache-nutch-2.2.1/conf/nutch-site.xml

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>

(2)vi /usr/search/apache-nutch-2.2.1/ivy/ivy.xml

默认情况下,此语句被注释掉,将其注释符号去掉,使其生效。

    <dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />

(3)vi /usr/search/apache-nutch-2.2.1/conf/gora.properties

添加以下语句:

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

以上三个步骤指定了使用HBase来进行存储。

以下步骤才是构建基本Nutch的必要步骤。

(4)构建runtime

cd /usr/search/apache-nutch-2.2.1/

ant runtime

(5)验证Nutch安装完成

[[email protected] apache-nutch-2.2.1]# cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/

[[email protected] bin]# ./nutch

Usage: nutch COMMAND

where COMMAND is one of:

inject         inject new urls into the database

hostinject     creates or updates an existing host table from a text file

generate       generate new batches to fetch from crawl db

fetch          fetch URLs marked during generate

parse          parse URLs marked during fetch

updatedb       update web table after parsing

updatehostdb   update host table after parsing

readdb         read/dump records from page database

readhostdb     display entries from the hostDB

elasticindex   run the elasticsearch indexer

solrindex      run the solr indexer on parsed batches

solrdedup      remove duplicates from solr

parsechecker   check the parser for a given url

indexchecker   check the indexing filters for a given url

plugin         load a plugin and run one of its classes main()

nutchserver    run a (local) Nutch server on a user defined port

junit          runs the given JUnit test

or

CLASSNAME      run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

(6)vi /usr/search/apache-nutch-2.2.1/runtime/local/conf/nutch-site.xml 添加搜索任务

<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>

(7)创建seed.txt

cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/

vi seed.txt

http://nutch.apache.org/

(8)修改网页过滤器  vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt

vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt

# accept anything else

+.

修改为

# accept anything else

+^http://([a-z0-9]*\.)*nutch.apache.org/

3、Hbase的配置

(1)vi /usr/search/hbase-0.90.4/conf/hbase-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.rootdir</name>
<value><Your path></value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value><Your path></value>
</property>
</configuration>

注:此步骤可不做。若不做,则使用hbase-default.xml(/usr/search/hbase-0.90.4/src/main/resources/hbase-default.xml)中的默认值。

默认值为:

  <property>
    <name>hbase.rootdir</name>
    <value>file:///tmp/hbase-${user.name}/hbase</value>
    <description>The directory shared by region servers and into
    which HBase persists.  The URL should be 'fully-qualified'
    to include the filesystem scheme.  For example, to specify the
    HDFS directory '/hbase' where the HDFS instance's namenode is
    running at namenode.example.org on port 9000, set this value to:
    hdfs://namenode.example.org:9000/hbase.  By default HBase writes
    into /tmp.  Change this configuration else all data will be lost
    on machine restart.
    </description>
  </property>

即默认情况下会放在/tmp目录,若机器重启,有可能数据丢失。

4、Solr的配置

(1)覆盖solr的schema.xml文件。

cp /usr/search/apache-nutch-2.2.1/conf/schema.xml /usr/search/solr-4.9.0/example/solr/collection1/conf/

(2)若使用solr3.6,则至此已经完成配置,但使用4.9,需要修改以下配置:

修改上述复制过来的schema.xml文件

删除:<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />

增加:<field name="_version_" type="long" indexed="true" stored="true"/>

5、启动抓取任务

(1)启动HBase

[[email protected] bin]# cd /usr/search/hbase-0.90.4/bin/

[[email protected] bin]# ./start-hbase.sh

(2)启动Solr

[[email protected] bin]# cd /usr/search/solr-4.9.0/example/

[[email protected] example]# java -jar start.jar

(3)启动Nutch,开始抓取任务

[[email protected] example]# cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/

[[email protected] bin]# ./crawl seed.txt TestCrawl http://localhost:8983/solr 2

大功告成,任务开始执行。

6、Nutch抓取的基本流程

Crawling the Web is already explained above. You can add more URLs in the seed.txt file and crawl the same.

When a user invokes a crawling command in Apache Nutch 1.x, CrawlDB is generated by Apache Nutch which is nothing but a directory and which contains details about crawling. In Apache 2.x, CrawlDB is not present. Instead, Apache Nutch keeps all the crawling
data directly in the database. In our case, we have used Apache HBase, so all crawling data would go inside Apache HBase. The following are details of how each function of crawling works.

A crawling cycle has four steps, in which each is implemented as a Hadoop MapReduce job:

? GeneratorJob

? FetcherJob

? ParserJob (optionally done while fetching using ‘fetch.parse‘)

? DbUpdaterJob

Additionally, the following processes need to be understood:

? InjectorJob

? Invertlinks

? Indexing with Apache Solr

First of all, the job of an Injector is to populate initial rows for the web table. The InjectorJob will initialize crawldb with the URLs that we have provided. We need to run the InjectorJob by providing certain URLs, which will then be inserted into crawlDB.

Then the GeneratorJob will use these injected URLs and perform the operation. The table which is used for input and output for these jobs is called webpage, in which

every row is a URL (web page). The row key is stored as a URL with reversed host components so that URLs from the same TLD and domain can be kept together and

form a group. In most NoSQL stores, row keys are sorted and give an advantage.

Using specific rowkey filtering, scanning will be faster over a subset, rather than scanning over the entire table. Following are the examples of rowkey listing:

? org.apache..www:http/

? org.apache.gora:http/

Let‘s define each step in depth so that we can understand crawling step-by-step.

Apache Nutch contains three main directories, crawlDB, linkdb, and a set of segments. crawlDB is the directory which contains information about every URL that is known to Apache Nutch. If it is fetched, crawlDB contains the details when it was fetched. The
linkdatabase or linkdb contains all the links to each URL which will include source URL and also the anchor text of the link. A set of segments is a URL set, which is fetched as a unit. This directory will contain the following subdirectories:

? A crawl_generate job will be used for a set of URLs to be fetched

? A crawl_fetch job will contain the status of fetching each URL

? A content will contain the content of rows retrieved from every URL

Now let‘s understand each job of crawling in detail.

集成Nutch/Hbase/Solr构建搜索引擎

时间: 2024-08-21 23:55:43

集成Nutch/Hbase/Solr构建搜索引擎的相关文章

集成Nutch/Hbase/Solr构建搜索引擎之二:内容分析

请先参见"集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行",搭建测试环境 http://blog.csdn.net/jediael_lu/article/details/37329731 一.被索引的域 Schema.xml 在使用solr对Nutch抓取到的网页进行索引时,schema.xml被改变成以下内容. 文件中指定了哪些域被索引.存储等内容. <?xml version="1.0" encoding="UTF-8"

集成Nutch/Hbase/Solr构建搜索引擎之三:内容修改

1.从content域中搜索 从solr的example中得到的solrConfig.xml中,qf的定义如下: [html] view plaincopy <str name="qf"> text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 title^10.0 description^5.0 keywords^5.0 author^2.0 resourcename^1.0 </str>

Nutch 快速入门(Nutch 2.2.1+Hbase+Solr)

http://www.tuicool.com/articles/VfEFjm Nutch 2.x 与 Nutch 1.x 相比,剥离出了存储层,放到了gora中,可以使用多种数据库,例如HBase, Cassandra, MySql来存储数据了.Nutch 1.7 则是把数据直接存储在HDFS上. 1. 安装并运行HBase 为了简单起见,使用Standalone模式,参考 HBase Quick start 1.1 下载,解压 wget http://archive.apache.org/di

nutch,solr集成在hadoop上

nutch是一个应用程序,在我的这个项目里主要是做爬虫用,爬取后的内容寄存在hdfs上,所以在hdfs结合模块现已结合上去了. solr: 在eclipse新建动态页面项目,删去WebContent的一切内容. 在solr/dist下(或许/solr3.6.2/example/webapps下)解压solr.war  将一切内容拷贝到WenContent里. 修正WEB-INF里的web.xml 增加 solr/home/home/hadoop/solr3.6.2/example/solrtyp

Nutch + Hbase

本文主要讲解内容包括:ant及ivy的搭建.Nutch + Hbase搭建 1.ant及ivy的搭建 1-1)ant下载地址http://ant.apache.org/bindownload.cgi 1-2)环境变量配置,修改linux /etc/profile文件内容,添加如下: export ANT_HOME=/usr/ant export PATH=$ANT_HOME/bin:$PATH 1-3)下载ivy build.xml http://ant.apache.org/ivy/histo

使用Coprocessor实现hbase+solr数据交互

HBase和Solr可以通过协处理器 Coprocessor 的方式向Solr发出请求,Solr对于接收到的数据可以做相关的同步:增.删.改索引的操作.使用solr作为hbase的二级索引,构建基于solr+hbase的快速多条件复杂查询. 查询时,先根据条件在solr中查找符合条件的rowkey,再根据rowkey从hbase中取数据,根据测试,分页查询时基本可以实现ms级的快速查询. 1. 编写SolrIndexCoprocessorObserver代码 package cn.ac.ict.

HBASE+Solr实现详单查询--转

原文地址:https://mp.weixin.qq.com/s?srcid=0831kfMZgtx1sQbzulgeIETs&scene=23&mid=2663994161&sn=cee222a8534cbc6e28c401706e979dc0&idx=1&__biz=MzA3ODUxMzQxMA%3D%3D&chksm=847c675cb30bee4a5c4e9a03a41662ba6f312d4ba28407311a80c4a36f3f93a4bb624

用持续集成工具Travis进行构建和部署

用持续集成工具Travis进行构建和部署 摘要:本文简单说明了如何使用持续集成工具Travis进行构建和部署的过程. 1. 概述 持续集成(Continuous Integration)是软件开发过程中的重要环节,不论是在开发环境,还是生产环境,其好处都是可以让团队尽快得到反馈,从而尽早发现和解决问题,不要等到用户来报告问题,影响产品和团队的声誉.越早越快地发现和解决问题,成本越低,这也是敏捷开发的基本目的之一. 持续集成的工具有不少,著名的有CruiseControl.JetBrains的Te

学习用Node.js和Elasticsearch构建搜索引擎(一)

最近的项目要用到快速全文检索,经过前期的调研,最后选用Elasticsearch搭建搜索服务器.以前做的项目中没用过这个搜索引擎,这是第一次使用. 主要是参照 <如何用 Node.js 和 Elasticsearch 构建搜索引擎>这篇文章学习的,这篇文章翻译得很好,整个过程都走下来了很流畅. 下面记录一下本人按照这篇文章的学习过程: 1.学习Elasticsearch概述. 了解Elasticsearch是什么?能做什么?可以查一下elasticsearch.lucene等的相关介绍,另外也