集成Nutch/Hbase/Solr构建搜索引擎

1、下载相关软件，并解压

版本号如下：

（1）apache-nutch-2.2.1

（2） hbase-0.90.4

（3）solr-4.9.0

并解压至/usr/search

2、Nutch的配置

（1）vi /usr/search/apache-nutch-2.2.1/conf/nutch-site.xml

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>

（2）vi /usr/search/apache-nutch-2.2.1/ivy/ivy.xml

默认情况下，此语句被注释掉，将其注释符号去掉，使其生效。

    <dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />

（3）vi /usr/search/apache-nutch-2.2.1/conf/gora.properties

添加以下语句：

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

以上三个步骤指定了使用HBase来进行存储。

以下步骤才是构建基本Nutch的必要步骤。

（4）构建runtime

cd /usr/search/apache-nutch-2.2.1/

ant runtime

（5）验证Nutch安装完成

[[email protected] apache-nutch-2.2.1]# cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/

[[email protected] bin]# ./nutch

Usage: nutch COMMAND

where COMMAND is one of:

inject inject new urls into the database

hostinject creates or updates an existing host table from a text file

generate generate new batches to fetch from crawl db

fetch fetch URLs marked during generate

parse parse URLs marked during fetch

updatedb update web table after parsing

updatehostdb update host table after parsing

readdb read/dump records from page database

readhostdb display entries from the hostDB

elasticindex run the elasticsearch indexer

solrindex run the solr indexer on parsed batches

solrdedup remove duplicates from solr

parsechecker check the parser for a given url

indexchecker check the indexing filters for a given url

plugin load a plugin and run one of its classes main()

nutchserver run a (local) Nutch server on a user defined port

junit runs the given JUnit test

CLASSNAME run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

（6）vi /usr/search/apache-nutch-2.2.1/runtime/local/conf/nutch-site.xml 添加搜索任务

<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>

（7）创建seed.txt

cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/

vi seed.txt

http://nutch.apache.org/

（8）修改网页过滤器 vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt

vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt

将

# accept anything else

修改为

# accept anything else

+^http://([a-z0-9]*\.)*nutch.apache.org/

3、Hbase的配置

（1）vi /usr/search/hbase-0.90.4/conf/hbase-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.rootdir</name>
<value><Your path></value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value><Your path></value>
</property>
</configuration>

注：此步骤可不做。若不做，则使用hbase-default.xml（/usr/search/hbase-0.90.4/src/main/resources/hbase-default.xml）中的默认值。

默认值为：

  <property>
    <name>hbase.rootdir</name>
    <value>file:///tmp/hbase-${user.name}/hbase</value>
    <description>The directory shared by region servers and into
    which HBase persists.  The URL should be 'fully-qualified'
    to include the filesystem scheme.  For example, to specify the
    HDFS directory '/hbase' where the HDFS instance's namenode is
    running at namenode.example.org on port 9000, set this value to:
    hdfs://namenode.example.org:9000/hbase.  By default HBase writes
    into /tmp.  Change this configuration else all data will be lost
    on machine restart.
    </description>
  </property>

即默认情况下会放在/tmp目录，若机器重启，有可能数据丢失。

4、Solr的配置

（1）覆盖solr的schema.xml文件。

cp /usr/search/apache-nutch-2.2.1/conf/schema.xml /usr/search/solr-4.9.0/example/solr/collection1/conf/

（2）若使用solr3.6，则至此已经完成配置，但使用4.9，需要修改以下配置：

修改上述复制过来的schema.xml文件

删除：<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />

增加：<field name="_version_" type="long" indexed="true" stored="true"/>

5、启动抓取任务

（1）启动HBase

[[email protected] bin]# cd /usr/search/hbase-0.90.4/bin/

[[email protected] bin]# ./start-hbase.sh

（2）启动Solr

[[email protected] bin]# cd /usr/search/solr-4.9.0/example/

[[email protected] example]# java -jar start.jar

（3）启动Nutch，开始抓取任务

[[email protected] example]# cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/

[[email protected] bin]# ./crawl seed.txt TestCrawl http://localhost:8983/solr 2

大功告成，任务开始执行。

6、Nutch抓取的基本流程

Crawling the Web is already explained above. You can add more URLs in the seed.txt file and crawl the same.

When a user invokes a crawling command in Apache Nutch 1.x, CrawlDB is generated by Apache Nutch which is nothing but a directory and which contains details about crawling. In Apache 2.x, CrawlDB is not present. Instead, Apache Nutch keeps all the crawling
data directly in the database. In our case, we have used Apache HBase, so all crawling data would go inside Apache HBase. The following are details of how each function of crawling works.

A crawling cycle has four steps, in which each is implemented as a Hadoop MapReduce job:

? GeneratorJob

? FetcherJob

? ParserJob (optionally done while fetching using ‘fetch.parse‘)

? DbUpdaterJob

Additionally, the following processes need to be understood:

? InjectorJob

? Invertlinks

? Indexing with Apache Solr

First of all, the job of an Injector is to populate initial rows for the web table. The InjectorJob will initialize crawldb with the URLs that we have provided. We need to run the InjectorJob by providing certain URLs, which will then be inserted into crawlDB.

Then the GeneratorJob will use these injected URLs and perform the operation. The table which is used for input and output for these jobs is called webpage, in which

every row is a URL (web page). The row key is stored as a URL with reversed host components so that URLs from the same TLD and domain can be kept together and

form a group. In most NoSQL stores, row keys are sorted and give an advantage.

Using specific rowkey filtering, scanning will be faster over a subset, rather than scanning over the entire table. Following are the examples of rowkey listing:

? org.apache..www:http/

? org.apache.gora:http/

Let‘s define each step in depth so that we can understand crawling step-by-step.

Apache Nutch contains three main directories, crawlDB, linkdb, and a set of segments. crawlDB is the directory which contains information about every URL that is known to Apache Nutch. If it is fetched, crawlDB contains the details when it was fetched. The
linkdatabase or linkdb contains all the links to each URL which will include source URL and also the anchor text of the link. A set of segments is a URL set, which is fetched as a unit. This directory will contain the following subdirectories:

? A crawl_generate job will be used for a set of URLs to be fetched

? A crawl_fetch job will contain the status of fetching each URL

? A content will contain the content of rows retrieved from every URL

Now let‘s understand each job of crawling in detail.

集成Nutch/Hbase/Solr构建搜索引擎

时间： 2024-08-21 23:55:43

集成Nutch/Hbase/Solr构建搜索引擎

集成Nutch/Hbase/Solr构建搜索引擎的相关文章

集成Nutch/Hbase/Solr构建搜索引擎之二：内容分析

集成Nutch/Hbase/Solr构建搜索引擎之三：内容修改

Nutch 快速入门(Nutch 2.2.1+Hbase+Solr)

nutch，solr集成在hadoop上

Nutch + Hbase

使用Coprocessor实现hbase+solr数据交互

HBASE+Solr实现详单查询--转

用持续集成工具Travis进行构建和部署

学习用Node.js和Elasticsearch构建搜索引擎（一）