



本教程虽然是Nutch 1.x的教程,但是官网上Nutch2.x的教程只是告诉我们怎么去配置一些新特性。Nutch2.x的基础教程,仍在在本教程中。


Apache Nutch is an open source Web crawler written in Java. By using it, we can find Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages
for searching over. That’s where Apache Solr comes in. Solr is an open source full text search framework, with Solr we can search the visited pages from Nutch. Luckily, integration between Nutch and Solr is pretty straightforward as explained below.

Apache Nutch supports Solr out-the-box, greatly simplifying Nutch-Solr integration. It also removes the legacy dependence upon both Apache Tomcat for running the old Nutch Web Application and upon Apache Lucene for indexing. Just download a binary release
from here.


Apache Nutch是一个开源的JAVA网络爬虫。Nutch会帮我们自动管理超链接信息,大大减少了维护的时间,比如检测损坏的链接、对已访问的页面做副本,提交给搜索引擎。


Apache Nutch支持Solr的out-the-box,大大简化了Nutch和Solr的集成。现在的版本移除了老版本中,利用tomcat和lucene进行索引的模块。










Apache Ant



2.Apache Ant非常必要。Nutch的整个编译过程是通过一个叫build.xml的配置文件来控制的。这个配置文件要有Ant才可以运行。Nutch官方源码没有提供Eclipse的配置文件,所以Eclipse不能直接编译Nutch。虽然可以利用Apache Ant将官方源码,转换成Eclipse工程,但是这样并不是很好。

3.要阅读下面的教程,一定要先安装Linux(或unix、cygwin)、JDK和apache ant,否则下面的步骤将无法进行。虽然安装这些东西可能需要花费数小时的时间,但是是必须的。

1. Install Nutch


Option 1: Setup Nutch from a binary distribution

  • Download a binary package ( fromhere.
  • Unzip your binary Nutch package. There should be a folder
  • cd apache-nutch-1.X/

From now on, we are going to use ${NUTCH_RUNTIME_HOME} to refer to the current directory (apache-nutch-1.X/).






Option 2: Set up Nutch from a source distribution

Advanced users may also use the source distribution:

  • Download a source package (
  • Unzip
  • cd apache-nutch-1.X/
  • Run ant in this folder (cf. RunNutchInEclipse)
  • Now there is a directory runtime/local which contains a ready to use Nutch installation.

When the source distribution is used ${NUTCH_RUNTIME_HOME} refers toapache-nutch-1.X/runtime/local/. Note that

  • config files should be modified in apache-nutch-1.X/runtime/local/conf/
  • ant clean will remove this directory (keep copies of modified config files)








为了简化描述,本文后面用${NUTCH_RUNTIME_HOME}来表示这里说的apache-nutch-1.x/runtime/local 文件夹。注意:


2.如果执行ant clean命令,会清除apache-nutch-1.x/runtime/local/conf/文件夹。所以在执行ant clean之前,请备份文件夹中Nutch的配置文件。

2. Verify your Nutch installation

  • run "bin/nutch" - You can confirm a correct installation if you see something similar to the following:
Usage: nutch COMMAND where command is one of:
crawl             one-step crawler for intranets (DEPRECATED)
readdb            read / dump crawl db
mergedb           merge crawldb-s, with optional filtering
readlinkdb        read / dump link db
inject            inject new urls into the database
generate          generate new segments to fetch from crawl db
freegen           generate new segments to fetch from text files
fetch             fetch a segment‘s pages

Some troubleshooting tips:

  • Run the following command if you are seeing "Permission denied":
chmod +x bin/nutch
  • Setup JAVA_HOME if you are seeing
    not set. On Mac, you can run the following command or add it to
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home

On Debian or Ubuntu, you can run the following command or add it to ~/.bashrc:

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")

2. 检测Nutch是否安装成功


如果出现"Permission denied"错误,运行下面命令:

chmod +x bin/nutch

如果出现JAVA_HOME not set,设置JAVA_HOME环境变量.在Mac电脑, 执行下面的命令,或者将下面这行加入到~/.bashrc文件:

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home

在Debian或者Ubuntu, 执行下面的命令,或者将下面这行加入到~/.bashrc文件:

export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")

3. Crawl your first website

Nutch requires two configuration changes before a website can be crawled:

  1. Customize your crawl properties, where at a minimum, you provide a name for your crawler for external servers to recognize
  2. Set a seed list of URLs to crawl

3. 爬取第一个网站




3.1 Customize your crawl properties

  • Default crawl properties can be viewed and edited within
    - where most of these can be used without modification
  • The file conf/nutch-site.xml serves as a place to add your own custom crawl properties that overwriteconf/nutch-default.xml. The only required modification for this file is to override thevalue
    field of the     

    • i.e. Add your agent name in the value field of property inconf/nutch-site.xml, for example:
 <value>My Nutch Spider</value>

3.1 配置爬取属性




 <value>My Nutch Spider</value>

3.2 Create a URL seed list

  • A URL seed list includes a list of websites, one-per-line, which nutch will look to crawl
  • The file conf/regex-urlfilter.txt will provide Regular Expressions that allow nutch to filter and narrow the types of web resources to crawl and download

Create a URL seed list

  • mkdir -p urls
  • cd urls
  • touch seed.txt to create a text fileseed.txt underurls/ with the following content (one URL per line for each site you want Nutch to crawl).

(Optional) Configure Regular Expression Filters

Edit the file conf/regex-urlfilter.txt and replace

# accept anything else

with a regular expression matching the domain you wish to crawl. For example, if you wished to limit the crawl to domain, the line should read:


This will include any URL in the domain

NOTE: Not specifying any domains to include within regex-urlfilter.txt will lead to all domains linking to your seed URLs file being crawled as well.

3.2 创建URL种子列表




  • mkdir -p urls
  • cd urls
  • touch seed.txt


(可选) 配置正则表达式过滤器

编辑 conf/regex-urlfilter.txt 替换

# accept anything else

为一个和你要爬取的域名匹配的正则。比如你想要限制爬虫只爬取 域名下的东西, 就替换为:


3.3 Using the Crawl Command

The crawl command is deprecated. Please see section 3.5 on how to use the crawl script that is intended to replace the crawl command.

Now we are ready to initiate a crawl, use the following parameters:

  • -dir dir names the directory to put the crawl in.
  • -threads threads determines the number of threads that will fetch in parallel.
  • -depth depth indicates the link depth from the root page that should be crawled.
  • -topN N determines the maximum number of pages that will be retrieved at each level up to the depth.
  • Run the following command:
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
  • Now you should be able to see the following directories created:

NOTE: If you have a Solr core already set up and wish to index to it, you are required to add the-solr <solrUrl> parameter to yourcrawl command e.g.

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

If not then please skip to here for how to set up your Solr instance and index your crawl data.

Typically one starts testing one‘s configuration by crawling at shallow depths, sharply limiting the number of pages fetched at each level (-topN), and watching the output to check that desired pages are fetched and undesirable pages
are not. Once one is confident of the configuration, then an appropriate depth for a full crawl is around 10. The number of pages per level (-topN) for a full crawl can be from tens of thousands to millions, depending on your resources.

3.3 使用爬取命令



  • -dir dir 存放爬取信息的文件夹
  • -threads threads 线程数
  • -depth depth 爬取深度
  • -topN N 每层爬取的最大页数
  • 运行下面命令
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
  • 执行命令后,下面的文件夹被创建了:

注意: 如果你已经架设好了Solr服务器,想用Solr对Nutch的爬取结果进行索引,你可以在crawl命令后添加-solr <solrUrl>参数

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

如果你没有Solr服务器,可以点击 here来学习架设Solr服务器,索引你爬取的数据。



