此文未完善。是否可以使用nutch逐步下载,未知。
1、基本操作,构建环境
(1)下载软件安装包,并解压至/usr/search/apache-nutch-2.2.1/
(2)构建runtime
cd /usr/search/apache-nutch-2.2.1/
ant runtime
(3)验证Nutch安装完成
[[email protected] apache-nutch-2.2.1]# cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/
[[email protected] bin]# ./nutch
Usage: nutch COMMAND
where COMMAND is one of:
inject inject new urls into the database
hostinject creates or updates an existing host table from a text file
generate generate new batches to fetch from crawl db
fetch fetch URLs marked during generate
parse parse URLs marked during fetch
updatedb update web table after parsing
updatehostdb update host table after parsing
readdb read/dump records from page database
readhostdb display entries from the hostDB
elasticindex run the elasticsearch indexer
solrindex run the solr indexer on parsed batches
solrdedup remove duplicates from solr
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
plugin load a plugin and run one of its classes main()
nutchserver run a (local) Nutch server on a user defined port
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
(4)vi /usr/search/apache-nutch-2.2.1/runtime/local/conf/nutch-site.xml 添加搜索任务
<property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property>
(5)创建seed.txt
cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/
vi seed.txt
http://nutch.apache.org/
(6)修改网页过滤器 vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt
vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt
将
# accept anything else
+.
修改为
# accept anything else
+^http://([a-z0-9]*\.)*nutch.apache.org/
When a user invokes a crawling command in Apache Nutch 1.x, CrawlDB is
generated by Apache Nutch which is nothing but a directory and which contains
details about crawling. In Apache 2.x, CrawlDB is not present. Instead, Apache
Nutch keeps all the crawling data directly in the database. In our case, we have used
Apache HBase, so all crawling data would go inside Apache HBase.
2 injectJob
[[email protected] local]# ./bin/nutch inject urls
InjectorJob: starting at 2014-07-07 14:15:21
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 2
Injector: finished at 2014-07-07 14:15:24, elapsed: 00:00:03
3 GenerateJob
[[email protected] local]# ./bin/nutch generate
Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]
-topN <N> - number of top URLs to be selected, default is Long.MAX_VALUE
-crawlId <id> - the id to prefix the schemas to operate on,
(default: storage.crawl.id)");
-noFilter - do not activate the filter plugin to filter the url, default is true
-noNorm - do not activate the normalizer plugin to normalize the url, default is true
-adddays - Adds numDays to the current time to facilitate crawling urls already
fetched sooner then db.fetch.interval.default. Default value is 0.
-batchId - the batch id
----------------------
Please set the params.
[[email protected] local]# ./bin/nutch generate -topN 3
GeneratorJob: starting at 2014-07-07 14:22:55
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: topN: 3
GeneratorJob: finished at 2014-07-07 14:22:58, time elapsed: 00:00:03
GeneratorJob: generated batch id: 1404714175-1017128204
4 FetcherJob
The job of the fetcher is to fetch the URLs which are generated by the GeneratorJob.
It will use the input provided by GeneratorJob. The following command will be
used for the FetcherJob:
[[email protected] local]# bin/nutch fetch –all
FetcherJob: starting
FetcherJob: batchId: –all
Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread0, activeThreads=0
-finishing thread FetcherThread1, activeThreads=0
-finishing thread FetcherThread2, activeThreads=0
-finishing thread FetcherThread3, activeThreads=0
-finishing thread FetcherThread4, activeThreads=0
-finishing thread FetcherThread5, activeThreads=0
-finishing thread FetcherThread6, activeThreads=0
-finishing thread FetcherThread7, activeThreads=1
-finishing thread FetcherThread8, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: done
Here I have provided input parameters—this means that this job will fetch all
the URLs that are generated by the GeneratorJob. You can use different input
parameters according to your needs.
5 ParserJob
After the FetcherJob, the ParserJob is to parse the URLs that are fetched by
FetcherJob. The following command will be used for the ParserJob:
[[email protected] local]# bin/nutch parse –all
ParserJob: starting
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: batchId: –all
ParserJob: success
[[email protected] local]#
I have used input parameters—all of which will parse all the URLs fetched by the
FetcherJob. You can use different input parameters according to your needs.
6 DbUpdaterJob
[[email protected] local]# ./bin/nutch updatedb
【未完善】使用nutch命令逐步下载网页