Nutch2.2.1抓取流程

一、抓取流程概述

1、nutch抓取流程

当使用crawl命令进行抓取任务时，其基本流程步骤如下：

（1）InjectorJob

开始第一个迭代

（2）GeneratorJob

（3）FetcherJob

（4）ParserJob

（5）DbUpdaterJob

（6）SolrIndexerJob

开始第二个迭代

（2）GeneratorJob
（3）FetcherJob
（4）ParserJob
（5）DbUpdaterJob
（6）SolrIndexerJob

开始第三个迭代

……

2、抓取日志

使用crawl命令进行抓取时，console输出日志如下：

InjectorJob: starting at 2014-07-08 10:41:27

InjectorJob: Injecting urlDir: urls

InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.

InjectorJob: total number of urls rejected by filters: 0

InjectorJob: total number of urls injected after normalization and filtering: 2

Injector: finished at 2014-07-08 10:41:32, elapsed: 00:00:05

Tue Jul 8 10:41:33 CST 2014 : Iteration 1 of 5

Generating batchId

Generating a new fetchlist

GeneratorJob: starting at 2014-07-08 10:41:34

GeneratorJob: Selecting best-scoring urls due for fetch.

GeneratorJob: starting

GeneratorJob: filtering: false

GeneratorJob: normalizing: false

GeneratorJob: topN: 50000

GeneratorJob: finished at 2014-07-08 10:41:39, time elapsed: 00:00:05

GeneratorJob: generated batch id: 1404787293-26339

Fetching :

FetcherJob: starting

FetcherJob: batchId: 1404787293-26339

Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.

FetcherJob: threads: 50

FetcherJob: parsing: false

FetcherJob: resuming: false

FetcherJob : timelimit set for : 1404798101129

Using queue mode : byHost

Fetcher: threads: 50

QueueFeeder finished: total 2 records. Hit by time limit :0

fetching http://www.csdn.net/ (queue crawl delay=5000ms)

Fetcher: throughput threshold: -1

Fetcher: throughput threshold sequence: 5

fetching http://www.itpub.net/ (queue crawl delay=5000ms)

-finishing thread FetcherThread47, activeThreads=48

-finishing thread FetcherThread46, activeThreads=47

-finishing thread FetcherThread45, activeThreads=46

-finishing thread FetcherThread44, activeThreads=45

-finishing thread FetcherThread43, activeThreads=44

-finishing thread FetcherThread42, activeThreads=43

-finishing thread FetcherThread41, activeThreads=42

-finishing thread FetcherThread40, activeThreads=41

-finishing thread FetcherThread39, activeThreads=40

-finishing thread FetcherThread38, activeThreads=39

-finishing thread FetcherThread37, activeThreads=38

-finishing thread FetcherThread36, activeThreads=37

-finishing thread FetcherThread35, activeThreads=36

-finishing thread FetcherThread34, activeThreads=35

-finishing thread FetcherThread33, activeThreads=34

-finishing thread FetcherThread32, activeThreads=33

-finishing thread FetcherThread31, activeThreads=32

-finishing thread FetcherThread30, activeThreads=31

-finishing thread FetcherThread29, activeThreads=30

-finishing thread FetcherThread48, activeThreads=29

-finishing thread FetcherThread27, activeThreads=29

-finishing thread FetcherThread26, activeThreads=28

-finishing thread FetcherThread25, activeThreads=27

-finishing thread FetcherThread24, activeThreads=26

-finishing thread FetcherThread23, activeThreads=25

-finishing thread FetcherThread22, activeThreads=24

-finishing thread FetcherThread21, activeThreads=23

-finishing thread FetcherThread20, activeThreads=22

-finishing thread FetcherThread19, activeThreads=21

-finishing thread FetcherThread18, activeThreads=20

-finishing thread FetcherThread17, activeThreads=19

-finishing thread FetcherThread16, activeThreads=18

-finishing thread FetcherThread15, activeThreads=17

-finishing thread FetcherThread14, activeThreads=16

-finishing thread FetcherThread13, activeThreads=15

-finishing thread FetcherThread12, activeThreads=14

-finishing thread FetcherThread11, activeThreads=13

-finishing thread FetcherThread10, activeThreads=12

-finishing thread FetcherThread9, activeThreads=11

-finishing thread FetcherThread8, activeThreads=10

-finishing thread FetcherThread7, activeThreads=9

-finishing thread FetcherThread5, activeThreads=8

-finishing thread FetcherThread4, activeThreads=7

-finishing thread FetcherThread3, activeThreads=6

-finishing thread FetcherThread2, activeThreads=5

-finishing thread FetcherThread49, activeThreads=4

-finishing thread FetcherThread6, activeThreads=3

-finishing thread FetcherThread28, activeThreads=2

-finishing thread FetcherThread0, activeThreads=1

fetch of http://www.itpub.net/ failed with: java.io.IOException: unzipBestEffort returned null

-finishing thread FetcherThread1, activeThreads=0

0/0 spinwaiting/active, 2 pages, 1 errors, 0.4 0 pages/s, 93 93 kb/s, 0 URLs in 0 queues

-activeThreads=0

FetcherJob: done

Parsing :

ParserJob: starting

ParserJob: resuming:    false

ParserJob: forced reparse:      false

ParserJob: batchId:     1404787293-26339

Parsing http://www.csdn.net/

http://www.csdn.net/ skipped. Content of size 92777 was truncated to 59561

Parsing http://www.itpub.net/

ParserJob: success

CrawlDB update for csdnitpub

DbUpdaterJob: starting

DbUpdaterJob: done

Indexing csdnitpub on SOLR index -> http://ip:8983/solr/

SolrIndexerJob: starting

SolrIndexerJob: done.

SOLR dedup -> http://ip:8983/solr/

Tue Jul 8 10:42:18 CST 2014 : Iteration 2 of 5

Generating batchId

Generating a new fetchlist

GeneratorJob: starting at 2014-07-08 10:42:19

GeneratorJob: Selecting best-scoring urls due for fetch.

GeneratorJob: starting

GeneratorJob: filtering: false

GeneratorJob: normalizing: false

GeneratorJob: topN: 50000

GeneratorJob: finished at 2014-07-08 10:42:25, time elapsed: 00:00:05

GeneratorJob: generated batch id: 1404787338-30453

Fetching :

FetcherJob: starting

FetcherJob: batchId: 1404787338-30453

Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.

FetcherJob: threads: 50

FetcherJob: parsing: false

FetcherJob: resuming: false

FetcherJob : timelimit set for : 1404798146676

Using queue mode : byHost

Fetcher: threads: 50

QueueFeeder finished: total 0 records. Hit by time limit :0

二、使用命令进行逐步抓取

crawlDb, linkDb, a set of segments.

1、InjectorJob

此步骤将seed.txt中的url注入抓取队列中进行初始化。

（1）基本命令

[[email protected] local]# bin/nutch inject urls/InjectorJob: starting at 2014-08-15 21:17:01InjectorJob: Injecting urlDir: urlsInjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.InjectorJob: total number of urls rejected by filters: 2InjectorJob: total number of urls injected after normalization and filtering: 3Injector: finished at 2014-08-15 21:17:06, elapsed: 00:00:05

其中urls/seed.txt的内容如下：

http://money.163.com/

http://www.hexun.com/

http://www.gw.com.cn/

（2）查看注入的url

上述步骤会在hbase中新建一个表，表名为test_1_webpage，url的相应内容会写入这张表

hbase(main):007:0> scan ‘test_1_webpage‘

ROW                              COLUMN+CELL                                                                       cn.com.gw.www:http/             column=f:fi, timestamp=1408086716518, value=\x00‘\x8D\x00                          cn.com.gw.www:http/             column=f:ts, timestamp=1408086716518, value=\x00\x00\x01G\xD8\x82\x1B"             cn.com.gw.www:http/             column=mk:_injmrk_, timestamp=1408086716518, value=y                               cn.com.gw.www:http/             column=mk:dist, timestamp=1408086716518, value=0                                   cn.com.gw.www:http/             column=mtdt:_csh_, timestamp=1408086716518, value=?\x80\x00\x00                     cn.com.gw.www:http/             column=s:s, timestamp=1408086716518, value=?\x80\x00\x00                           com.163.money:http/             column=f:fi, timestamp=1408086716518, value=\x00‘\x8D\x00                         com.163.money:http/             column=f:ts, timestamp=1408086716518, value=\x00\x00\x01G\xD8\x82\x1B"               com.163.money:http/             column=mk:_injmrk_, timestamp=1408086716518, value=y                              com.163.money:http/             column=mk:dist, timestamp=1408086716518, value=0                                   com.163.money:http/             column=mtdt:_csh_, timestamp=1408086716518, value=?\x80\x00\x00                     com.163.money:http/             column=s:s, timestamp=1408086716518, value=?\x80\x00\x00                          com.hexun.www:http/             column=f:fi, timestamp=1408086716518, value=\x00‘\x8D\x00                          com.hexun.www:http/             column=f:ts, timestamp=1408086716518, value=\x00\x00\x01G\xD8\x82\x1B"             com.hexun.www:http/             column=mk:_injmrk_, timestamp=1408086716518, value=y                               com.hexun.www:http/             column=mk:dist, timestamp=1408086716518, value=0                                   com.hexun.www:http/             column=mtdt:_csh_, timestamp=1408086716518, value=?\x80\x00\x00                    com.hexun.www:http/             column=s:s, timestamp=1408086716518, value=?\x80\x00\x00                           3 row(s) in 0.1100 seconds

(3)关于**_webpage表

对于每一个任务，均会生成一个crawlId_webpage的表，所有已抓取及未抓取的url相关信息均会存入此表。

若url未抓取，则该url相应的行信息较少。若url已经抓取，则抓取到的内容也会放入该行，如网页内容等。

2、GeneratorJob

（1）基本命令

[[email protected] local]# bin/nutch generate -crawlId test_2

GeneratorJob: starting at 2014-08-15 21:24:49
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: finished at 2014-08-15 21:24:55, time elapsed: 00:00:05
GeneratorJob: generated batch id: 1408109089-403376773

（2）命令选项

[[email protected] local]# bin/nutch generate
Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]

 -topN <N>      - number of top URLs to be selected, default is Long.MAX_VALUE

   -crawlId <id>  - the id to prefix the schemas to operate on, default: storage.crawl.id)");

   -noFilter      - do not activate the filter plugin to filter the url, default is true

    -noNorm        - do not activate the normalizer plugin to normalize the url, default is true
    -adddays       - Adds numDays to the current time to facilitate crawling urls already fetched sooner then db.fetch.interval.default. Default value is 0.    -batchId       - the batch id
----------------------
Please set the params.

3、FetcherJob

（1）基本命令

[[email protected] local]# bin/nutch fetch -all -crawlId test_2
FetcherJob: starting
FetcherJob: fetching all
Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 3 records. Hit by time limit :0
fetching http://www.gw.com.cn/ (queue crawl delay=5000ms)
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching http://www.hexun.com/ (queue crawl delay=5000ms)
-finishing thread FetcherThread2, activeThreads=8
-finishing thread FetcherThread7, activeThreads=7
-finishing thread FetcherThread6, activeThreads=6
-finishing thread FetcherThread5, activeThreads=5
-finishing thread FetcherThread4, activeThreads=4
-finishing thread FetcherThread3, activeThreads=3
fetching http://money.163.com/ (queue crawl delay=5000ms)
-finishing thread FetcherThread9, activeThreads=3
-finishing thread FetcherThread1, activeThreads=2
-finishing thread FetcherThread0, activeThreads=1
-finishing thread FetcherThread8, activeThreads=0
0/0 spinwaiting/active, 3 pages, 0 errors, 0.6 1 pages/s, 307 307 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: done

4、ParserJob

（1）基本命令

[[email protected] local]# bin/nutch parse  -all -crawlId test_2

ParserJob: starting
ParserJob: resuming:    false
ParserJob: forced reparse:      false
ParserJob: parsing all
Parsing http://www.gw.com.cn/
Parsing http://money.163.com/
Parsing http://www.hexun.com/
ParserJob: success

（2）命令参数

[[email protected] local]# bin/nutch parse
Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force]
    <batchId>     - symbolic batch ID created by Generator
    -crawlId <id> - the id to prefix the schemas to operate on,
                    (default: storage.crawl.id)
    -all          - consider pages from all crawl jobs
    -resume       - resume a previous incomplete job
    -force        - force re-parsing even if a page is already parsed

5、DbUpdaterJob

（1）基本命令

[[email protected] local]# bin/nutch updatedb
DbUpdaterJob: starting
DbUpdaterJob: done

6、SolrIndexerJob

（1）基本命令

[[email protected] local]# bin/nutch solrindex http://182.92.160.44:8583/solr/ -crawlId test_2
SolrIndexerJob: starting
SolrIndexerJob: done.

（2）命令参数

[[email protected] local]# bin/nutch solrindex

Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>]

一、抓取流程概述

1、nutch抓取流程

当使用crawl命令进行抓取任务时，其基本流程步骤如下：

（1）InjectorJob

开始第一个迭代

（2）GeneratorJob

（3）FetcherJob

（4）ParserJob

（5）DbUpdaterJob

（6）SolrIndexerJob

开始第二个迭代

（2）GeneratorJob
（3）FetcherJob
（4）ParserJob
（5）DbUpdaterJob
（6）SolrIndexerJob

开始第三个迭代

……

2、抓取日志

使用crawl命令进行抓取时，console输出日志如下：

InjectorJob: starting at 2014-07-08 10:41:27

InjectorJob: Injecting urlDir: urls

InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.

InjectorJob: total number of urls rejected by filters: 0

InjectorJob: total number of urls injected after normalization and filtering: 2

Injector: finished at 2014-07-08 10:41:32, elapsed: 00:00:05

Tue Jul 8 10:41:33 CST 2014 : Iteration 1 of 5

Generating batchId

Generating a new fetchlist

GeneratorJob: starting at 2014-07-08 10:41:34

GeneratorJob: Selecting best-scoring urls due for fetch.

GeneratorJob: starting

GeneratorJob: filtering: false

GeneratorJob: normalizing: false

GeneratorJob: topN: 50000

GeneratorJob: finished at 2014-07-08 10:41:39, time elapsed: 00:00:05

GeneratorJob: generated batch id: 1404787293-26339

Fetching :

FetcherJob: starting

FetcherJob: batchId: 1404787293-26339

Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.

FetcherJob: threads: 50

FetcherJob: parsing: false

FetcherJob: resuming: false

FetcherJob : timelimit set for : 1404798101129

Using queue mode : byHost

Fetcher: threads: 50

QueueFeeder finished: total 2 records. Hit by time limit :0

fetching http://www.csdn.net/ (queue crawl delay=5000ms)

Fetcher: throughput threshold: -1

Fetcher: throughput threshold sequence: 5

fetching http://www.itpub.net/ (queue crawl delay=5000ms)

-finishing thread FetcherThread47, activeThreads=48

-finishing thread FetcherThread46, activeThreads=47

-finishing thread FetcherThread45, activeThreads=46

-finishing thread FetcherThread44, activeThreads=45

-finishing thread FetcherThread43, activeThreads=44

-finishing thread FetcherThread42, activeThreads=43

-finishing thread FetcherThread41, activeThreads=42

-finishing thread FetcherThread40, activeThreads=41

-finishing thread FetcherThread39, activeThreads=40

-finishing thread FetcherThread38, activeThreads=39

-finishing thread FetcherThread37, activeThreads=38

-finishing thread FetcherThread36, activeThreads=37

-finishing thread FetcherThread35, activeThreads=36

-finishing thread FetcherThread34, activeThreads=35

-finishing thread FetcherThread33, activeThreads=34

-finishing thread FetcherThread32, activeThreads=33

-finishing thread FetcherThread31, activeThreads=32

-finishing thread FetcherThread30, activeThreads=31

-finishing thread FetcherThread29, activeThreads=30

-finishing thread FetcherThread48, activeThreads=29

-finishing thread FetcherThread27, activeThreads=29

-finishing thread FetcherThread26, activeThreads=28

-finishing thread FetcherThread25, activeThreads=27

-finishing thread FetcherThread24, activeThreads=26

-finishing thread FetcherThread23, activeThreads=25

-finishing thread FetcherThread22, activeThreads=24

-finishing thread FetcherThread21, activeThreads=23

-finishing thread FetcherThread20, activeThreads=22

-finishing thread FetcherThread19, activeThreads=21

-finishing thread FetcherThread18, activeThreads=20

-finishing thread FetcherThread17, activeThreads=19

-finishing thread FetcherThread16, activeThreads=18

-finishing thread FetcherThread15, activeThreads=17

-finishing thread FetcherThread14, activeThreads=16

-finishing thread FetcherThread13, activeThreads=15

-finishing thread FetcherThread12, activeThreads=14

-finishing thread FetcherThread11, activeThreads=13

-finishing thread FetcherThread10, activeThreads=12

-finishing thread FetcherThread9, activeThreads=11

-finishing thread FetcherThread8, activeThreads=10

-finishing thread FetcherThread7, activeThreads=9

-finishing thread FetcherThread5, activeThreads=8

-finishing thread FetcherThread4, activeThreads=7

-finishing thread FetcherThread3, activeThreads=6

-finishing thread FetcherThread2, activeThreads=5

-finishing thread FetcherThread49, activeThreads=4

-finishing thread FetcherThread6, activeThreads=3

-finishing thread FetcherThread28, activeThreads=2

-finishing thread FetcherThread0, activeThreads=1

fetch of http://www.itpub.net/ failed with: java.io.IOException: unzipBestEffort returned null

-finishing thread FetcherThread1, activeThreads=0

0/0 spinwaiting/active, 2 pages, 1 errors, 0.4 0 pages/s, 93 93 kb/s, 0 URLs in 0 queues

-activeThreads=0

FetcherJob: done

Parsing :

ParserJob: starting

ParserJob: resuming:    false

ParserJob: forced reparse:      false

ParserJob: batchId:     1404787293-26339

Parsing http://www.csdn.net/

http://www.csdn.net/ skipped. Content of size 92777 was truncated to 59561

Parsing http://www.itpub.net/

ParserJob: success

CrawlDB update for csdnitpub

DbUpdaterJob: starting

DbUpdaterJob: done

Indexing csdnitpub on SOLR index -> http://ip:8983/solr/

SolrIndexerJob: starting

SolrIndexerJob: done.

SOLR dedup -> http://ip:8983/solr/

Tue Jul 8 10:42:18 CST 2014 : Iteration 2 of 5

Generating batchId

Generating a new fetchlist

GeneratorJob: starting at 2014-07-08 10:42:19

GeneratorJob: Selecting best-scoring urls due for fetch.

GeneratorJob: starting

GeneratorJob: filtering: false

GeneratorJob: normalizing: false

GeneratorJob: topN: 50000

GeneratorJob: finished at 2014-07-08 10:42:25, time elapsed: 00:00:05

GeneratorJob: generated batch id: 1404787338-30453

Fetching :

FetcherJob: starting

FetcherJob: batchId: 1404787338-30453

Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.

FetcherJob: threads: 50

FetcherJob: parsing: false

FetcherJob: resuming: false

FetcherJob : timelimit set for : 1404798146676

Using queue mode : byHost

Fetcher: threads: 50

QueueFeeder finished: total 0 records. Hit by time limit :0

二、使用命令进行逐步抓取

crawlDb, linkDb, a set of segments.

1、InjectorJob

此步骤将seed.txt中的url注入抓取队列中进行初始化。

（1）基本命令

[[email protected] local]# bin/nutch inject urls/InjectorJob: starting at 2014-08-15 21:17:01InjectorJob: Injecting urlDir: urlsInjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.InjectorJob: total number of urls rejected by filters: 2InjectorJob: total number of urls injected after normalization and filtering: 3Injector: finished at 2014-08-15 21:17:06, elapsed: 00:00:05

其中urls/seed.txt的内容如下：

http://money.163.com/

http://www.hexun.com/

http://www.gw.com.cn/

（2）查看注入的url

上述步骤会在hbase中新建一个表，表名为test_1_webpage，url的相应内容会写入这张表

hbase(main):007:0> scan ‘test_1_webpage‘

ROW                              COLUMN+CELL                                                                       cn.com.gw.www:http/             column=f:fi, timestamp=1408086716518, value=\x00‘\x8D\x00                          cn.com.gw.www:http/             column=f:ts, timestamp=1408086716518, value=\x00\x00\x01G\xD8\x82\x1B"             cn.com.gw.www:http/             column=mk:_injmrk_, timestamp=1408086716518, value=y                               cn.com.gw.www:http/             column=mk:dist, timestamp=1408086716518, value=0                                   cn.com.gw.www:http/             column=mtdt:_csh_, timestamp=1408086716518, value=?\x80\x00\x00                     cn.com.gw.www:http/             column=s:s, timestamp=1408086716518, value=?\x80\x00\x00                           com.163.money:http/             column=f:fi, timestamp=1408086716518, value=\x00‘\x8D\x00                         com.163.money:http/             column=f:ts, timestamp=1408086716518, value=\x00\x00\x01G\xD8\x82\x1B"               com.163.money:http/             column=mk:_injmrk_, timestamp=1408086716518, value=y                              com.163.money:http/             column=mk:dist, timestamp=1408086716518, value=0                                   com.163.money:http/             column=mtdt:_csh_, timestamp=1408086716518, value=?\x80\x00\x00                     com.163.money:http/             column=s:s, timestamp=1408086716518, value=?\x80\x00\x00                          com.hexun.www:http/             column=f:fi, timestamp=1408086716518, value=\x00‘\x8D\x00                          com.hexun.www:http/             column=f:ts, timestamp=1408086716518, value=\x00\x00\x01G\xD8\x82\x1B"             com.hexun.www:http/             column=mk:_injmrk_, timestamp=1408086716518, value=y                               com.hexun.www:http/             column=mk:dist, timestamp=1408086716518, value=0                                   com.hexun.www:http/             column=mtdt:_csh_, timestamp=1408086716518, value=?\x80\x00\x00                    com.hexun.www:http/             column=s:s, timestamp=1408086716518, value=?\x80\x00\x00                           3 row(s) in 0.1100 seconds

(3)关于**_webpage表

对于每一个任务，均会生成一个crawlId_webpage的表，所有已抓取及未抓取的url相关信息均会存入此表。

若url未抓取，则该url相应的行信息较少。若url已经抓取，则抓取到的内容也会放入该行，如网页内容等。

2、GeneratorJob

（1）基本命令

[[email protected] local]# bin/nutch generate -crawlId test_2

GeneratorJob: starting at 2014-08-15 21:24:49
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: finished at 2014-08-15 21:24:55, time elapsed: 00:00:05
GeneratorJob: generated batch id: 1408109089-403376773

（2）命令选项

[[email protected] local]# bin/nutch generate
Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]

 -topN <N>      - number of top URLs to be selected, default is Long.MAX_VALUE

   -crawlId <id>  - the id to prefix the schemas to operate on, default: storage.crawl.id)");

   -noFilter      - do not activate the filter plugin to filter the url, default is true

    -noNorm        - do not activate the normalizer plugin to normalize the url, default is true
    -adddays       - Adds numDays to the current time to facilitate crawling urls already fetched sooner then db.fetch.interval.default. Default value is 0.    -batchId       - the batch id
----------------------
Please set the params.

3、FetcherJob

（1）基本命令

[[email protected] local]# bin/nutch fetch -all -crawlId test_2
FetcherJob: starting
FetcherJob: fetching all
Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 3 records. Hit by time limit :0
fetching http://www.gw.com.cn/ (queue crawl delay=5000ms)
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
fetching http://www.hexun.com/ (queue crawl delay=5000ms)
-finishing thread FetcherThread2, activeThreads=8
-finishing thread FetcherThread7, activeThreads=7
-finishing thread FetcherThread6, activeThreads=6
-finishing thread FetcherThread5, activeThreads=5
-finishing thread FetcherThread4, activeThreads=4
-finishing thread FetcherThread3, activeThreads=3
fetching http://money.163.com/ (queue crawl delay=5000ms)
-finishing thread FetcherThread9, activeThreads=3
-finishing thread FetcherThread1, activeThreads=2
-finishing thread FetcherThread0, activeThreads=1
-finishing thread FetcherThread8, activeThreads=0
0/0 spinwaiting/active, 3 pages, 0 errors, 0.6 1 pages/s, 307 307 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: done

4、ParserJob

（1）基本命令

[[email protected] local]# bin/nutch parse  -all -crawlId test_2

ParserJob: starting
ParserJob: resuming:    false
ParserJob: forced reparse:      false
ParserJob: parsing all
Parsing http://www.gw.com.cn/
Parsing http://money.163.com/
Parsing http://www.hexun.com/
ParserJob: success

（2）命令参数

[[email protected] local]# bin/nutch parse
Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force]
    <batchId>     - symbolic batch ID created by Generator
    -crawlId <id> - the id to prefix the schemas to operate on,
                    (default: storage.crawl.id)
    -all          - consider pages from all crawl jobs
    -resume       - resume a previous incomplete job
    -force        - force re-parsing even if a page is already parsed

5、DbUpdaterJob

（1）基本命令

[[email protected] local]# bin/nutch updatedb
DbUpdaterJob: starting
DbUpdaterJob: done

6、SolrIndexerJob

（1）基本命令

[[email protected] local]# bin/nutch solrindex http://182.92.160.44:8583/solr/ -crawlId test_2
SolrIndexerJob: starting
SolrIndexerJob: done.

（2）命令参数

[[email protected] local]# bin/nutch solrindex

Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>]

Nutch2.2.1抓取流程,布布扣,bubuko.com

时间： 2024-12-25 18:53:53

Nutch2.2.1抓取流程的相关文章

nutch2.3爬虫抓取电影网站

上一篇文章介绍了nutch的安装该文会简单的抓取网站 http://www.6vhao.com 1,打开目录nutch-2.3/runtime/local 2,mkdir urls nano urls/url:添加链接 http://www.6vhao.com保存退出 3,在local目录下使用命令 ./bin/nutch 会出现所有可以使用的命令 inject inject new urls into the database hostinject creates

scrapy2——框架简介和抓取流程

scrapy简介 ? Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架. 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中 scrapy的执行流程 Scrapy主要包括了以下组件: 引擎(Scrapy): 用来处理整个系统的数据流处理, 触发事务(框架核心) 调度器(Scheduler): 用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同

使用node.js cheerio抓取网页数据

想要自动从网页抓一些数据或者想把一坨从什么博客上拉来的数据转成一种有结构的数据? 居然没有现成的API可以取数据?!!! [email protected]#[email protected]#$… 没关系网页抓取可以解决. 什么是网页抓取? 你可能会问... 网页抓取是以编程的方式(通常不用浏览器参与)检索网页的内容并从中提取数据的过程. 本文,小编会给大家展示一套强大的抓取工具,可以快速的对网面进行抓取,而且容易上手,它是由javascript 和node.js实现的. 最近我需要爬一些大

苹果icloud邮箱抓取

1 icloud登录,与其他网站登录区别 1.1 支持pop抓取的邮箱:pop提供统一接口,抓取简单: 1.2 没有前端js加密的邮箱(139,126,163):只要代码正确模拟登录流程,参数正确,即可正确爬取邮箱: 1.3 需要前端js加密(sina邮箱web端,微博):前端用户名密码需要js加密,加密算法各网站不同.通常需要模拟js加密(可以自己写php,java模拟js,也可以通过其他方式直接运行js代码得到结果,java就可以直接调用js代码,php可通过phantomjs获取js

C# 抓取网站数据

项目主管说这是项目中的一个亮点(无语...), 类似于爬虫一类的东西,模拟登陆后台系统,获取需要的数据.然后就开始研究这个. 之前有一些数据抓取的经验,抓取流程无非:设置参数->服务端发送请求->解析结果 1.验证码识别系统的验证码只包含数字,不复杂,所以没有深入研究. http://www.cnblogs.com/ivanyb/archive/2011/11/25/2262964.html 这个完全满足我的需求. 2.用户名.密码是用户提供的. 这里面有一个证书,每次请求都要带上. 证书获

汽车之家店铺商品详情数据抓取 DotnetSpider实战[二]

一.迟到的下期预告自从上一篇文章发布到现在,大约差不多有3个月的样子,其实一直想把这个实战入门系列的教程写完,一个是为了支持DotnetSpider,二个是为了.Net 社区发展献出一份绵薄之力,这个开源项目作者一直都在更新,相对来说还是很不错的,上次教程的版本还是2.4.4,今天浏览了一下这个项目,最近一次更新是在3天前,已经更新到了2.5.0,而且项目star也已经超过1000了,还是挺受大家所喜爱的,也在这感谢作者们不断的努力. 之所以中间这么长一段时间没有好好写文章,是因为笔者为参加3

170112-机械臂moveit!抓取

前言这一节有一点高级了参考 Mastering ROS 学习记录使用3D视觉传感器这是实现抓取任务的基础注意!传感器可以由Gazebo模拟,也可以直接和物理设备相连传入Gazebo中确认Gazebo插件正确工作使用RViz查看Gazebo插件传出的点云数据 roslaunch seven_dof_arm_gazebo seven_dof_arm_bringup_grasping # 这个命令打开gazebo,关节控制器,gazebo视觉传感器插件在gazebo中添加桌子和物体打

数据从业者必读：抓取了一千亿个网页后我才明白，爬虫一点都不简单

编者按:互联网上有浩瀚的数据资源,要想抓取这些数据就离不开爬虫.鉴于网上免费开源的爬虫框架多如牛毛,很多人认为爬虫定是非常简单的事情.但是如果你要定期上规模地准确抓取各种大型网站的数据却是一项艰巨的挑战,其中包括网站的格式经常会变.架构必须能灵活伸缩应对规模变化同时要保持性能,与此同时还要挫败网站反机器人的手段以及维护数据质量.流行的Python爬虫框架Scrapy开发者Scrapinghub分享了他们抓取一千亿个网页后的经验之谈. 现在爬虫技术似乎是很容易的事情,但这种看法是很有迷惑性的.开源

nutch2.1抓取中文网站

对nutch添加中文网站抓取功能. 1.中文网页抓取 A.调整mysql配置,避免存入mysql的中文出现乱码.修改 ${APACHE_NUTCH_HOME} /runtime/local/conf/gora.properties ############################### # MySQL properties # ############################### gora.sqlstore.jdbc.driver=com.mysql.jd