【Nutch基础教程之七】Nutch的2种运行模式:local及deploy

在对nutch源代码运行ant runtime后,会创建一个runtime的目录,在runtime目录下有deploy和local 2个目录。

[[email protected] runtime]$ ls

deploy  local

这2个目录分别代表nutch的2种运行方式:部署模式及本地模式。

以下以inject为例,示范2种运行模式。

一、本地模式

1、基本用法:

$ bin/nutch inject
Usage: InjectorJob <url_dir> [-crawlId <id>]

用法一:未指定id

liaoliuqingdeMacBook-Air:local liaoliuqing$ bin/nutch inject urls
InjectorJob: starting at 2014-12-20 22:32:01
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1

Injector: finished at 2014-12-20 22:32:15, elapsed: 00:00:14

用法二:指定id

$ bin/nutch inject urls -crawlId 2
InjectorJob: starting at 2014-12-20 22:34:01
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1

Injector: finished at 2014-12-20 22:34:15, elapsed: 00:00:14

2、数据库中的数据变化

上述命令将在hbase数据库中新建一个表,表名为${id}_webpage,若未指定id,则表名为webpage.

然后将urls目录中的文件内容写入表中,作为爬虫种子。

hbase(main):003:0> scan 'webpage'
ROW                   COLUMN+CELL
 com.163.www:http/    column=f:fi, timestamp=1419085934952, value=\x00'\x8D\x00
 com.163.www:http/    column=f:ts, timestamp=1419085934952, value=\x00\x00\x01Jh
                      \x1C\xBC7
 com.163.www:http/    column=mk:_injmrk_, timestamp=1419085934952, value=y
 com.163.www:http/    column=mk:dist, timestamp=1419085934952, value=0
 com.163.www:http/    column=mtdt:_csh_, timestamp=1419085934952, value=?\x80\x0
                      0\x00
 com.163.www:http/    column=s:s, timestamp=1419085934952, value=?\x80\x00\x00
1 row(s) in 0.6140 seconds

当再次执行inject命令时,会增加新的url进入表中。

3、其它运行脚本

where COMMAND is one of:
 inject         inject new urls into the database
 hostinject     creates or updates an existing host table from a text file
 generate       generate new batches to fetch from crawl db
 fetch          fetch URLs marked during generate
 parse          parse URLs marked during fetch
 updatedb       update web table after parsing
 updatehostdb   update host table after parsing
 readdb         read/dump records from page database
 readhostdb     display entries from the hostDB
 elasticindex   run the elasticsearch indexer
 solrindex      run the solr indexer on parsed batches
 solrdedup      remove duplicates from solr
 parsechecker   check the parser for a given url
 indexchecker   check the indexing filters for a given url
 plugin         load a plugin and run one of its classes main()
 nutchserver    run a (local) Nutch server on a user defined port
 junit          runs the given JUnit test
 or
 CLASSNAME      run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

可以逐步运行一个完整抓取流程中的各个步骤,形成一个整体的流程。

当使用crawl命令进行抓取任务时,其基本流程步骤如下:

(1)InjectorJob

开始第一个迭代

(2)GeneratorJob

(3)FetcherJob

(4)ParserJob

(5)DbUpdaterJob

(6)SolrIndexerJob

开始第二个迭代

(2)GeneratorJob

(3)FetcherJob

(4)ParserJob

(5)DbUpdaterJob

(6)SolrIndexerJob

开始第三个迭代

具体每个步骤的执行,请见http://blog.csdn.net/jediael_lu/article/details/38591067

4、nutch封装了一个crawl脚本,将各个关键步骤进行了封装,从而无需逐步运行抓取流程。

[[email protected] local]$ bin/crawl
Missing seedDir : crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>

如:

[[email protected] bin]# ./crawl seed.txt TestCrawl http://localhost:8983/solr 2

二、部署模式

1、使用hadoop命令运行

注意:必须先启动hadoop及hbase。

[[email protected] deploy]$ hadoop jar apache-nutch-2.2.1.job org.apache.nutch.crawl.InjectorJob file:///opt/jediael/apache-nutch-2.2.1/runtime/deploy/urls/
14/12/20 23:26:50 INFO crawl.InjectorJob: InjectorJob: starting at 2014-12-20 23:26:50
14/12/20 23:26:50 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: file:/opt/jediael/apache-nutch-2.2.1/runtime/deploy/urls
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.3.2-1031432, built on 11/05/2010 05:32 GMT
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:host.name=jediael
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:java.version=1.7.0_51
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:java.home=/usr/java/jdk1.7.0_51/jre
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:java.class.path=/opt/jediael/hadoop-1.2.1/libexec/../conf:/usr/java/jdk1.7.0_51/lib/tools.jar:/opt/jediael/hadoop-1.2.1/libexec/..:/opt/jediael/hadoop-1.2.1/libexec/../hadoop-core-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/asm-3.2.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/aspectjrt-1.6.11.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/aspectjtools-1.6.11.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-beanutils-1.7.0.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-beanutils-core-1.8.0.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-cli-1.2.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-codec-1.4.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-collections-3.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-configuration-1.6.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-daemon-1.0.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-digester-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-el-1.0.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-httpclient-3.0.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-io-2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-lang-2.4.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-logging-1.1.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-logging-api-1.0.4.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-math-2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/commons-net-3.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/core-3.1.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/hadoop-capacity-scheduler-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/hadoop-fairscheduler-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/hadoop-thriftfs-1.2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/hsqldb-1.8.0.10.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jackson-core-asl-1.8.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jackson-mapper-asl-1.8.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jasper-compiler-5.5.12.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jasper-runtime-5.5.12.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jdeb-0.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jersey-core-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jersey-json-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jersey-server-1.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jets3t-0.6.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jetty-6.1.26.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jetty-util-6.1.26.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jsch-0.1.42.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/junit-4.5.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/kfs-0.2.2.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/log4j-1.2.15.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/mockito-all-1.8.5.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/oro-2.0.8.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/servlet-api-2.5-20081211.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/slf4j-api-1.4.3.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/slf4j-log4j12-1.4.3.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/xmlenc-0.52.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jsp-2.1/jsp-2.1.jar:/opt/jediael/hadoop-1.2.1/libexec/../lib/jsp-2.1/jsp-api-2.1.jar
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:java.library.path=/opt/jediael/hadoop-1.2.1/libexec/../lib/native/Linux-amd64-64
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:os.version=2.6.32-431.17.1.el6.x86_64
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:user.name=jediael
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:user.home=/home/jediael
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Client environment:user.dir=/opt/jediael/apache-nutch-2.2.1/runtime/deploy
14/12/20 23:26:52 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection
14/12/20 23:26:52 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181
14/12/20 23:26:52 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
14/12/20 23:26:52 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x14a5c24c9cf0657, negotiated timeout = 40000
14/12/20 23:26:52 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
14/12/20 23:26:55 INFO input.FileInputFormat: Total input paths to process : 1
14/12/20 23:26:55 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/12/20 23:26:55 WARN snappy.LoadSnappy: Snappy native library not loaded
14/12/20 23:26:56 INFO mapred.JobClient: Running job: job_201412202325_0002
14/12/20 23:26:57 INFO mapred.JobClient:  map 0% reduce 0%
14/12/20 23:27:15 INFO mapred.JobClient:  map 100% reduce 0%
14/12/20 23:27:17 INFO mapred.JobClient: Job complete: job_201412202325_0002
14/12/20 23:27:18 INFO mapred.JobClient: Counters: 20
14/12/20 23:27:18 INFO mapred.JobClient:   Job Counters
14/12/20 23:27:18 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=14058
14/12/20 23:27:18 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/12/20 23:27:18 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/12/20 23:27:18 INFO mapred.JobClient:     Rack-local map tasks=1
14/12/20 23:27:18 INFO mapred.JobClient:     Launched map tasks=1
14/12/20 23:27:18 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
14/12/20 23:27:18 INFO mapred.JobClient:   File Output Format Counters
14/12/20 23:27:18 INFO mapred.JobClient:     Bytes Written=0
14/12/20 23:27:18 INFO mapred.JobClient:   injector
14/12/20 23:27:18 INFO mapred.JobClient:     urls_injected=3
14/12/20 23:27:18 INFO mapred.JobClient:   FileSystemCounters
14/12/20 23:27:18 INFO mapred.JobClient:     FILE_BYTES_READ=149
14/12/20 23:27:18 INFO mapred.JobClient:     HDFS_BYTES_READ=130
14/12/20 23:27:18 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=78488
14/12/20 23:27:18 INFO mapred.JobClient:   File Input Format Counters
14/12/20 23:27:18 INFO mapred.JobClient:     Bytes Read=149
14/12/20 23:27:18 INFO mapred.JobClient:   Map-Reduce Framework
14/12/20 23:27:18 INFO mapred.JobClient:     Map input records=6
14/12/20 23:27:18 INFO mapred.JobClient:     Physical memory (bytes) snapshot=106311680
14/12/20 23:27:18 INFO mapred.JobClient:     Spilled Records=0
14/12/20 23:27:18 INFO mapred.JobClient:     CPU time spent (ms)=2420
14/12/20 23:27:18 INFO mapred.JobClient:     Total committed heap usage (bytes)=29753344
14/12/20 23:27:18 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=736796672
14/12/20 23:27:18 INFO mapred.JobClient:     Map output records=3
14/12/20 23:27:18 INFO mapred.JobClient:     SPLIT_RAW_BYTES=130
14/12/20 23:27:18 INFO crawl.InjectorJob: InjectorJob: total number of urls rejected by filters: 0
14/12/20 23:27:18 INFO crawl.InjectorJob: InjectorJob: total number of urls injected after normalization and filtering: 3
14/12/20 23:27:18 INFO crawl.InjectorJob: Injector: finished at 2014-12-20 23:27:18, elapsed: 00:00:27

三、附带使用eclipse运行nutch的方式

此方法本质上是与部署模式一致的。

使用eclipse运行InjectorJob

eclipse输出内容:

InjectorJob: starting at 2014-12-20 23:13:24
InjectorJob: Injecting urlDir: /Users/liaoliuqing/99_Project/2.x/urls
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1

Injector: finished at 2014-12-20 23:13:27, elapsed: 00:00:02
时间: 2024-08-10 19:15:45

【Nutch基础教程之七】Nutch的2种运行模式:local及deploy的相关文章

【Nutch基础教程之七】Nutch的2种执行模式:local及deploy

在对nutch源码执行ant runtime后,会创建一个runtime的文件夹.在runtime文件夹下有deploy和local 2个文件夹. [[email protected] runtime]$ ls deploy  local 这2个文件夹分别代表nutch的2种执行方式:部署模式及本地模式. 1.nutch.sh中关于2种执行方式的执行 if $local; then # fix for the external Xerces lib issue with SAXParserFac

spark学习(基础篇)--(第三节)Spark几种运行模式

h2 { color: #fff; background-color: #7CCD7C; padding: 3px; margin: 10px 0px } h3 { color: #fff; background-color: #008eb7; padding: 3px; margin: 10px 0px } spark应用执行机制分析 前段时间一直在编写指标代码,一直采用的是--deploy-mode client方式开发测试,因此执行没遇到什么问题,但是放到生产上采用--master yar

从零学习Fluter(八):Flutter的四种运行模式--Debug、Release、Profile和test以及命名规范

从零学习Fluter(八):Flutter的四种运行模式--Debug.Release.Profile和test以及命名规范 好几天没有跟新我的这个系列文章,一是因为这两天我又在之前的基础上,重新认识flutter,觉得flutter这个东西越来越有意思.并且水很深 今天简单分享一下开发学习中的小知识点 Flutter有四种运行模式:Debug.Release.Profile和test,这四种模式在build的时候是完全独立的 Debug ??Debug模式可以在真机和模拟器上同时运行:会打开所

Tomcat Connector三种运行模式(BIO, NIO, APR)的比较和优化

Tomcat Connector的三种不同的运行模式性能相差很大,有人测试过的结果如下: 这三种模式的不同之处如下: BIO: 一个线程处理一个请求.缺点:并发量高时,线程数较多,浪费资源. Tomcat7或以下,在Linux系统中默认使用这种方式. NIO: 利用Java的异步IO处理,可以通过少量的线程处理大量的请求. Tomcat8在Linux系统中默认使用这种方式. Tomcat7必须修改Connector配置来启动: <Connector port="8080" pro

转:Windows下的PHP开发环境搭建——PHP线程安全与非线程安全、Apache版本选择,及详解五种运行模式。

原文来自于:http://www.ituring.com.cn/article/128439 Windows下的PHP开发环境搭建——PHP线程安全与非线程安全.Apache版本选择,及详解五种运行模式. 今天为在Windows下建立PHP开发环境,在考虑下载何种PHP版本时,遭遇一些让我困惑的情况,为了解决这些困惑,不出意料地牵扯出更多让我困惑的问题. 为了将这些困惑一网打尽,我花了一下午加一晚上的时间查阅了大量资料,并做了一番实验后,终于把这些困惑全都搞得清清楚楚了. 说实话,之所以花了这么

Tomcat Connector的三种运行模式

详情参考: http://tomcat.apache.org/tomcat-7.0-doc/apr.html http://www.365mini.com/page/tomcat-connector-mode.htm 操作环境:rhel6.3 x86_x64. tomcat7.0.42 tomcat connector三种运行模式分别为:bio.nio和apr.你可以简单地理解成,性能上:bio<nio<=apr 其中bio为默认运行方式,即(server.xml): <Connecto

Spark on YARN两种运行模式介绍

本文出自:Spark on YARN两种运行模式介绍http://www.aboutyun.com/thread-12294-1-1.html(出处: about云开发)   问题导读 1.Spark在YARN中有几种模式? 2.Yarn Cluster模式,Driver程序在YARN中运行,应用的运行结果在什么地方可以查看? 3.由client向ResourceManager提交请求,并上传jar到HDFS上包含哪些步骤? 4.传递给app的参数应该通过什么来指定? 5.什么模式下最后将结果输

Spring Cloud Alibaba基础教程:支持的几种服务消费方式(RestTemplate、WebClient、Feign)

通过<Spring Cloud Alibaba基础教程:使用Nacos实现服务注册与发现>一文的学习,我们已经学会如何使用Nacos来实现服务的注册与发现,同时也介绍如何通过LoadBalancerClient接口来获取某个服务的具体实例,并根据实例信息来发起服务接口消费请求.但是这样的做法需要我们手工的去编写服务选取.链接拼接等繁琐的工作,对于开发人员来说非常的不友好.所以接下来,我们再来看看除此之外,还支持哪些其他的服务消费方式. 使用RestTemplate 在之前的例子中,已经使用过R

【HBase基础教程】1、HBase之单机模式与伪分布式模式安装

在这篇blog中,我们将介绍Hbase的单机模式安装与伪分布式的安装方式,以及通过浏览器查看Hbase的用户界面.搭建hbase伪分布式环境的前提是我们已经搭建好了hadoop完全分布式环境,搭建hadoop环境请参考:[Hadoop基础教程]4.Hadoop之完全分布式环境搭建 开发环境 硬件环境:Centos 6.5 服务器4台(一台为Master节点,三台为Slave节点) 软件环境:Java 1.7.0_45.Eclipse Juno Service Release 2.hadoop-1