(1)vi /usr/search/apache-nutch-2.2.1/conf/nutch-site.xml

<description>Default class for storing data</description>

(2)vi /usr/search/apache-nutch-2.2.1/ivy/ivy.xml


    <dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />

(3)vi /usr/search/apache-nutch-2.2.1/conf/gora.properties






cd /usr/search/apache-nutch-2.2.1/

ant runtime


[[email protected] apache-nutch-2.2.1]# cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/

[[email protected] bin]# ./nutch

Usage: nutch COMMAND

where COMMAND is one of:

inject         inject new urls into the database

hostinject     creates or updates an existing host table from a text file

generate       generate new batches to fetch from crawl db

fetch          fetch URLs marked during generate

parse          parse URLs marked during fetch

updatedb       update web table after parsing

updatehostdb   update host table after parsing

readdb         read/dump records from page database

readhostdb     display entries from the hostDB

elasticindex   run the elasticsearch indexer

solrindex      run the solr indexer on parsed batches

solrdedup      remove duplicates from solr

parsechecker   check the parser for a given url

indexchecker   check the indexing filters for a given url

plugin         load a plugin and run one of its classes main()

nutchserver    run a (local) Nutch server on a user defined port

junit          runs the given JUnit test


CLASSNAME      run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

(6)vi /usr/search/apache-nutch-2.2.1/runtime/local/conf/nutch-site.xml 添加搜索任务

<value>My Nutch Spider</value>


cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/

vi seed.txt


(8)修改网页过滤器  vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt

vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt

# accept anything else



(1)vi /usr/search/hbase-0.90.4/conf/hbase-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<value><Your path></value>
<value><Your path></value>



    <description>The directory shared by region servers and into
    which HBase persists.  The URL should be 'fully-qualified'
    to include the filesystem scheme.  For example, to specify the
    HDFS directory '/hbase' where the HDFS instance's namenode is
    running at namenode.example.org on port 9000, set this value to:
    hdfs://namenode.example.org:9000/hbase.  By default HBase writes
    into /tmp.  Change this configuration else all data will be lost
    on machine restart.




cp /usr/search/apache-nutch-2.2.1/conf/schema.xml /usr/search/solr-4.9.0/example/solr/collection1/conf/



删除:<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />

增加:<field name="_version_" type="long" indexed="true" stored="true"/>



[[email protected] bin]# cd /usr/search/hbase-0.90.4/bin/

[[email protected] bin]# ./start-hbase.sh


[[email protected] bin]# cd /usr/search/solr-4.9.0/example/

[[email protected] example]# java -jar start.jar


[[email protected] example]# cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/

[[email protected] bin]# ./crawl seed.txt TestCrawl http://localhost:8983/solr 2



Crawling the Web is already explained above. You can add more URLs in the seed.txt file and crawl the same.

When a user invokes a crawling command in Apache Nutch 1.x, CrawlDB is generated by Apache Nutch which is nothing but a directory and which contains details about crawling. In Apache 2.x, CrawlDB is not present. Instead, Apache Nutch keeps all the crawling
data directly in the database. In our case, we have used Apache HBase, so all crawling data would go inside Apache HBase. The following are details of how each function of crawling works.

A crawling cycle has four steps, in which each is implemented as a Hadoop MapReduce job:

? GeneratorJob

? FetcherJob

? ParserJob (optionally done while fetching using ‘fetch.parse‘)

? DbUpdaterJob

Additionally, the following processes need to be understood:

? InjectorJob

? Invertlinks

? Indexing with Apache Solr

First of all, the job of an Injector is to populate initial rows for the web table. The InjectorJob will initialize crawldb with the URLs that we have provided. We need to run the InjectorJob by providing certain URLs, which will then be inserted into crawlDB.

Then the GeneratorJob will use these injected URLs and perform the operation. The table which is used for input and output for these jobs is called webpage, in which

every row is a URL (web page). The row key is stored as a URL with reversed host components so that URLs from the same TLD and domain can be kept together and

form a group. In most NoSQL stores, row keys are sorted and give an advantage.

Using specific rowkey filtering, scanning will be faster over a subset, rather than scanning over the entire table. Following are the examples of rowkey listing:

? org.apache..www:http/

? org.apache.gora:http/

Let‘s define each step in depth so that we can understand crawling step-by-step.

Apache Nutch contains three main directories, crawlDB, linkdb, and a set of segments. crawlDB is the directory which contains information about every URL that is known to Apache Nutch. If it is fetched, crawlDB contains the details when it was fetched. The
linkdatabase or linkdb contains all the links to each URL which will include source URL and also the anchor text of the link. A set of segments is a URL set, which is fetched as a unit. This directory will contain the following subdirectories:

? A crawl_generate job will be used for a set of URLs to be fetched

? A crawl_fetch job will contain the status of fetching each URL

? A content will contain the content of rows retrieved from every URL

Now let‘s understand each job of crawling in detail.


请先参见"集成Nutch/Hbase/Solr构建搜索引擎之一:安装及运行",搭建测试环境 http://blog.csdn.net/jediael_lu/article/details/37329731 一.被索引的域 Schema.xml 在使用solr对Nutch抓取到的网页进行索引时,schema.xml被改变成以下内容. 文件中指定了哪些域被索引.存储等内容. <?xml version="1.0" encoding="UTF-8"


1.从content域中搜索 从solr的example中得到的solrConfig.xml中,qf的定义如下: [html] view plaincopy <str name="qf"> text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4 title^10.0 description^5.0 keywords^5.0 author^2.0 resourcename^1.0 </str>

Nutch 快速入门(Nutch 2.2.1+Hbase+Solr)

http://www.tuicool.com/articles/VfEFjm Nutch 2.x 与 Nutch 1.x 相比,剥离出了存储层,放到了gora中,可以使用多种数据库,例如HBase, Cassandra, MySql来存储数据了.Nutch 1.7 则是把数据直接存储在HDFS上. 1. 安装并运行HBase 为了简单起见,使用Standalone模式,参考 HBase Quick start 1.1 下载,解压 wget http://archive.apache.org/di


nutch是一个应用程序,在我的这个项目里主要是做爬虫用,爬取后的内容寄存在hdfs上,所以在hdfs结合模块现已结合上去了. solr: 在eclipse新建动态页面项目,删去WebContent的一切内容. 在solr/dist下(或许/solr3.6.2/example/webapps下)解压solr.war  将一切内容拷贝到WenContent里. 修正WEB-INF里的web.xml 增加 solr/home/home/hadoop/solr3.6.2/example/solrtyp

Nutch + Hbase

本文主要讲解内容包括:ant及ivy的搭建.Nutch + Hbase搭建 1.ant及ivy的搭建 1-1)ant下载地址http://ant.apache.org/bindownload.cgi 1-2)环境变量配置,修改linux /etc/profile文件内容,添加如下: export ANT_HOME=/usr/ant export PATH=$ANT_HOME/bin:$PATH 1-3)下载ivy build.xml http://ant.apache.org/ivy/histo


HBase和Solr可以通过协处理器 Coprocessor 的方式向Solr发出请求,Solr对于接收到的数据可以做相关的同步:增.删.改索引的操作.使用solr作为hbase的二级索引,构建基于solr+hbase的快速多条件复杂查询. 查询时,先根据条件在solr中查找符合条件的rowkey,再根据rowkey从hbase中取数据,根据测试,分页查询时基本可以实现ms级的快速查询. 1. 编写SolrIndexCoprocessorObserver代码 package cn.ac.ict.




