Nutch学习笔记——抓取过程简析

Nutch学习笔记二——抓取过程简析

学习环境: ubuntu

概要:

Nutch 是一个开源Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫。

通过nutch,诞生了hadoop、tika、gora。

先安装SVN和Ant环境。(通过编译源码方式来使用nutch)

apt-get install ant
apt-get install subversion

[email protected]:~/data/nutch$ svn co https://svn.apache.org/repos/asf/nutch/tags/release-1.6/    
[email protected]:~/data/nutch$ cd release-1.6/    
[email protected]:~/data/nutch/release-1.6$ ant    
[email protected]:~/data/nutch/release-1.6$ cd runtime/

备注runtime目录下有两个目录,分别代表了nutch两种不同运行方式。deploy依赖hadoop。    
[email protected]:~/data/nutch/release-1.6/runtime$ ls      
deploy  local

那nutch和hadoop是通过什么连接起来的?    
[email protected]:~/data/nutch/release-1.6/runtime$ ls deploy/      
apache-nutch-1.6.job  bin

是通过nutch脚本。通过hadoop命令吧apache-nutch-1.6.job提交给hadoop的JobTracker。

[email protected]:~/data/nutch/release-1.6/runtime$ cd local/  
[email protected]:~/data/nutch/release-1.6/runtime/local$ mkdir urls    
[email protected]:~/data/nutch/release-1.6/runtime/local$ touch urls/url.txt    
[email protected]:~/data/nutch/release-1.6/runtime/local$ vi urls/url.txt    
备注:urls/url.txt中输入爬取地址 http://blog.tianya.cn

[email protected]:~/data/nutch/release-1.6/runtime/local$ ./bin/nutch crawl    
Usage: Crawl <urlDir> -solr <solrURL> [-dir d] [-threads n] [-depth i] [-topN N]    
[email protected]:~/data/nutch/release-1.6/runtime/local$ nohup ./bin/nutch crawl urls -dir data -threads 100 -depth 3 &

备注:查看运行概要 [email protected]:~/data/nutch/release-1.6/runtime/local$ cat nohup.out      
查看运行详情 通过logs/hadoop.log文件

[email protected]:~/data/nutch/release-1.6/runtime/local$ ls logs/    
hadoop.log

通过查看nohup.out发现出现异常  
[email protected]:~/data/nutch/release-1.6/runtime/local$ cat nohup.out    
solrUrl is not set, indexing will be skipped...    
crawl started in: data    
rootUrlDir = urls    
threads = 100    
depth = 3    
solrUrl=null    
Injector: starting at 2013-12-08 21:10:30    
Injector: crawlDb: data/crawldb    
Injector: urlDir: urls    
Injector: Converting injected urls to crawl db entries.    
solrUrl is not set, indexing will be skipped...    
crawl started in: data    
rootUrlDir = urls    
threads = 100    
depth = 3    
solrUrl=null    
Injector: starting at 2013-12-08 21:10:38    
Injector: crawlDb: data/crawldb    
Injector: urlDir: urls    
Injector: Converting injected urls to crawl db entries.    
Injector: total number of urls rejected by filters: 0    
Injector: total number of urls injected after normalization and filtering: 1    
Injector: Merging injected urls into crawl db.    
Injector: finished at 2013-12-08 21:10:53, elapsed: 00:00:14    
Generator: starting at 2013-12-08 21:10:53    
Generator: Selecting best-scoring urls due for fetch.    
Generator: filtering: true    
Generator: normalizing: true    
Generator: jobtracker is ‘local‘, generating exactly one partition.    
Generator: Partitioning selected urls for politeness.    
Generator: segment: data/segments/20131208211101    
Generator: finished at 2013-12-08 21:11:08, elapsed: 00:00:15    
Fetcher: No agents listed in ‘http.agent.name‘ property.    
Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in ‘http.agent.name‘ property.      
    at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1389)    
    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1274)    
    at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)    
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)    
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

【解决方案】    
[email protected]:~/data/nutch/release-1.6/runtime/local$ vi conf/nutch-site.xml    
打开 conf/nutch-site.xml. 在nutch-site.xml中添加"http.agent.name"信息。 (conf/nutch-default.xml有默认配置信息)    
<configuration>    
    <property>    
      <name>http.agent.name</name>    
      <value>Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0; WUID=11ec69f3ac129124d5a2480d127648e0; WTB=2938) Gecko/20100101 Firefox/20.0</value>    
      <description>HTTP ‘User-Agent‘ request header. MUST NOT be empty -    
      please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

http.robots.agents  
        http.agent.description    
        http.agent.url    
        http.agent.email    
        http.agent.version

and set their values appropriately.

</description>  
    </property>    
</configuration>

(如果修改源文件中配置文件,即/release-1.6/conf/nutch-site.xml,在更改nutch配置文件之后,需要重新进行ant编译)

[email protected]:~/data/nutch/release-1.6/runtime/local$ ls data/  
crawldb  linkdb  segments

下回再学关于查看抓取数据详细信息。

总结:nutch的入门重点在于分析nutch脚本文件

参考:

http://yangshangchuan.iteye.com/category/275433

http://www.oschina.net/translate/nutch-tutorial  Nutch 教程

在上篇学习笔记中http://www.cnblogs.com/huligong1234/p/3464371.html 主要记录Nutch安装及简单运行的过程。

笔记中 通过配置抓取地址http://blog.tianya.cn并执行抓取命令 nohup ./bin/nutch crawl urls -dir data -threads 100 -depth 3 &

进行了抓取。本次笔记主要对抓取的过程进行说明。

首先这里简要列下抓取命令常用参数:

参数:

  • -dir dir 指定用于存放抓取文件的目录名称。
  • -threads threads 决定将会在获取是并行的线程数。
  • -depth depth 表明从根网页开始那应该被抓取的链接深度。
  • -topN N 决定在每一深度将会被取回的网页的最大数目。

我们之前的抓取命令中:nohup ./bin/nutch crawl urls -dir data -threads 100 -depth 3 &

depth配置为3,也就是限定了抓取深度为3,即告诉Crawler需要执行3次“产生/抓取/更新”就可以抓取完毕了。那么现在要解释下两个问题,一是何谓一次“产生/抓取/更新”,二是每一次过程都做了哪些事情。

下面慢慢来解释,先查看日志

[email protected]:~/data/nutch/release-1.6/runtime/local$ pwd
/home/hu/data/nutch/release-1.6/runtime/local      
[email protected]:~/data/nutch/release-1.6/runtime/local$ less nohup.out


………

Injector: starting at 2013-12-08 21:36:58            
Injector: crawlDb: data/crawldb            
Injector: urlDir: urls            
Injector: Converting injected urls to crawl db entries.            
Injector: total number of urls rejected by filters: 0            
Injector: total number of urls injected after normalization and filtering: 1            
Injector: Merging injected urls into crawl db.            
Injector: finished at 2013-12-08 21:37:15, elapsed: 00:00:17            
Generator: starting at 2013-12-08 21:37:15            
Generator: Selecting best-scoring urls due for fetch.            
Generator: filtering: true            
Generator: normalizing: true            
Generator: jobtracker is ‘local‘, generating exactly one partition.            
Generator: Partitioning selected urls for politeness.            
Generator: segment: data/segments/20131208213723            
Generator: finished at 2013-12-08 21:37:30, elapsed: 00:00:15            
Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.            
Fetcher: starting at 2013-12-08 21:37:30            
Fetcher: segment: data/segments/20131208213723            
Using queue mode : byHost            
Fetcher: threads: 100            
Fetcher: time-out divisor: 2            
QueueFeeder finished: total 1 records + hit by time limit :0

……………

Fetcher: finished at 2013-12-08 21:37:37, elapsed: 00:00:07            
ParseSegment: starting at 2013-12-08 21:37:37            
ParseSegment: segment: data/segments/20131208213723            
Parsed (14ms):http://blog.tianya.cn/            
ParseSegment: finished at 2013-12-08 21:37:45, elapsed: 00:00:07            
CrawlDb update: starting at 2013-12-08 21:37:45            
CrawlDb update: db: data/crawldb            
CrawlDb update: segments: [data/segments/20131208213723]            
CrawlDb update: additions allowed: true            
CrawlDb update: URL normalizing: true            
CrawlDb update: URL filtering: true            
CrawlDb update: 404 purging: false            
CrawlDb update: Merging segment data into db.            
CrawlDb update: finished at 2013-12-08 21:37:58, elapsed: 00:00:13            
Generator: starting at 2013-12-08 21:37:58            
Generator: Selecting best-scoring urls due for fetch.            
Generator: filtering: true            
Generator: normalizing: true            
Generator: jobtracker is ‘local‘, generating exactly one partition.            
Generator: Partitioning selected urls for politeness.            
Generator: segment: data/segments/20131208213806            
Generator: finished at 2013-12-08 21:38:13, elapsed: 00:00:15            
Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.            
Fetcher: starting at 2013-12-08 21:38:13            
Fetcher: segment: data/segments/20131208213806            
Using queue mode : byHost

从上面日志可以看到,抓取过程先从Injector(注入初始Url,即将文本文件中的url 存入到crawldb中)开始

抓取过程为:    
Injector->    
                  Generator->Fetcher->ParseSegment->CrawlDb update  depth=1    
                  Generator->Fetcher->ParseSegment->CrawlDb update  depth=2    
                  Generator->Fetcher->ParseSegment->CrawlDb update->LinkDb  depth=3    
即循环Generator->Fetcher->ParseSegment->CrawlDb update 这个过程;    
第一次注入url初值,Generator urls,Fetcher网页,ParseSegment解析数据,update CrawlDb 。之后每次更新crawldb,即url库。

下图提供网上找来的相关流程图片,以便于理解:

总结如下:    
1) 建立初始 URL 集    
2) 将 URL 集注入 crawldb 数据库---inject    
3) 根据 crawldb 数据库创建抓取列表---generate    
4) 执行抓取,获取网页信息---fetch

5) 解析抓取的内容---parse segment 
6) 更新数据库,把获取到的页面信息存入数据库中---updatedb    
7) 重复进行 3~5 的步骤,直到预先设定的抓取深度。---这个循环过程被称为“产生/抓取/更新”循环    
8) 根据 sengments 的内容更新 linkdb 数据库---invertlinks    
9) 建立索引---index

抓取完成之后生成3个目录(crawldb linkdb segments):

[email protected]:~/data/nutch/release-1.6/runtime/local$ ls ./data/        
crawldb  linkdb  segments        
[email protected]:~/data/nutch/release-1.6/runtime/local$ ls ./data/crawldb/        
current  old        
[email protected]:~/data/nutch/release-1.6/runtime/local$ ls ./data/crawldb/current/        
part-00000        
[email protected]:~/data/nutch/release-1.6/runtime/local$ ls ./data/crawldb/current/part-00000/        
data  index

[email protected]:~/data/nutch/release-1.6/runtime/local$ ls ./data/crawldb/current/part-00000/data      
./data/crawldb/current/part-00000/data      
[email protected]:~/data/nutch/release-1.6/runtime/local$ ls ./data/crawldb/current/part-00000/index      
./data/crawldb/current/part-00000/index

Nutch的数据文件:      
crawldb: 爬行数据库,用来存储所要爬行的网址。      
linkdb: 链接数据库,用来存储每个网址的链接地址,包括源地址和链接地址。      
segments: 抓取的网址被作为一个单元,而一个segment就是一个单元。

crawldb

crawldb中存放的是url地址,第一次根据所给url  :http://blog.tianya.cn进行注入,然后update crawldb 保存第一次抓取的url地址,下一次即depth=2的时候就会从crawldb中获取新的url地址集,进行新一轮的抓取。

crawldb中有两个文件夹:current 和old.  current就是当前url地址集,old是上一次的一个备份。每一次生成新的,都会把原来的改为old。    
current和old结构相同 里面都有part-00000这样的一个文件夹(local方式下只有1个) 在part-00000里面分别有data和index两个文件。一个存放数据,一个存放索引。

另外Nutch也提供了对crawldb文件夹状态查看命令(readdb):

[email protected]:~/data/nutch/release-1.6/runtime/local$ ./bin/nutch readdb      
Usage: CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -topN <nnnn> <out_dir> [<min>] | -url <url>)    
    <crawldb>    directory name where crawldb is located    
    -stats [-sort]     print overall statistics to System.out    
        [-sort]    list status sorted by host    
    -dump <out_dir> [-format normal|csv|crawldb]    dump the whole db to a text file in <out_dir>    
        [-format csv]    dump in Csv format    
        [-format normal]    dump in standard format (default option)    
        [-format crawldb]    dump as CrawlDB    
        [-regex <expr>]    filter records with expression    
        [-status <status>]    filter records by CrawlDatum status    
    -url <url>    print information on <url> to System.out    
    -topN <nnnn> <out_dir> [<min>]    dump top <nnnn> urls sorted by score to <out_dir>    
        [<min>]    skip records with scores below this value.    
            This can significantly improve performance.    
[email protected]:~/data/nutch/release-1.6/runtime/local$ ./bin/nutch readdb ./data/crawldb -stats    
CrawlDb statistics start: ./data/crawldb    
Statistics for CrawlDb: ./data/crawldb    
TOTAL urls:    2520    
retry 0:    2520    
min score:    0.0    
avg score:    8.8253967E-4    
max score:    1.014    
status 1 (db_unfetched):    2346    
status 2 (db_fetched):    102    
status 3 (db_gone):    1    
status 4 (db_redir_temp):    67    
status 5 (db_redir_perm):    4    
CrawlDb statistics: done

说明:    
-stats命令是一个快速查看爬取信息的很有用的命令:

TOTAL urls:表示当前在crawldb中的url数量。    
db_unfetched:链接到已爬取页面但还没有被爬取的页面数(原因是它们没有通过url过滤器的过滤,或者包括在了TopN之外被Nutch丢弃)    
db_gone:表示发生了404错误或者其他一些臆测的错误,这种状态阻止了对其以后的爬取工作。    
db_fetched:表示已爬取和索引的页面,如果其值为0,那肯定出错了。    
db_redir_temp和db_redir_perm分别表示临时重定向和永久重定向的页面。

min score、avg score、max score是分值算法的统计值,是网页重要性的依据,这里暂且不谈。

此外,还可以通过readdb的dump命令将crawldb中内容输出到文件中进行查看:

[email protected]:~/data/nutch/release-1.6/runtime/local$ ./bin/nutch readdb ./data/crawldb -dump crawl_tianya_out      
CrawlDb dump: starting      
CrawlDb db: ./data/crawldb      
CrawlDb dump: done      
[email protected]:~/data/nutch/release-1.6/runtime/local$ ls ./crawl_tianya_out/      
part-00000      
[email protected]:~/data/nutch/release-1.6/runtime/local$ less ./crawl_tianya_out/part-00000


http://100w.tianya.cn/  Version: 7            
Status: 1 (db_unfetched)            
Fetch time: Sun Dec 08 21:42:34 CST 2013            
Modified time: Thu Jan 01 08:00:00 CST 1970            
Retries since fetch: 0            
Retry interval: 2592000 seconds (30 days)            
Score: 1.3559322E-5            
Signature: null            
Metadata:

http://aimin_001.blog.tianya.cn/        Version: 7            
Status: 4 (db_redir_temp)            
Fetch time: Tue Jan 07 21:38:13 CST 2014            
Modified time: Thu Jan 01 08:00:00 CST 1970            
Retries since fetch: 0            
Retry interval: 2592000 seconds (30 days)            
Score: 0.016949153            
Signature: null            
Metadata: Content-Type: text/html_pst_: temp_moved(13), lastModified=0: http://blog.tianya.cn/blogger/blog_main.asp?BlogID=134876

http://alice.tianya.cn/ Version: 7            
Status: 1 (db_unfetched)            
Fetch time: Sun Dec 08 21:42:34 CST 2013            
Modified time: Thu Jan 01 08:00:00 CST 1970            
Retries since fetch: 0            
Retry interval: 2592000 seconds (30 days)            
Score: 3.3898305E-6            
Signature: null            
Metadata:

http://anger.blog.tianya.cn/    Version: 7            
Status: 4 (db_redir_temp)            
Fetch time: Tue Jan 07 21:38:13 CST 2014            
Modified time: Thu Jan 01 08:00:00 CST 1970            
Retries since fetch: 0            
Retry interval: 2592000 seconds (30 days)            
Score: 0.016949153            
Signature: null            
Metadata: Content-Type: text/html_pst_: temp_moved(13), lastModified=0: http://blog.tianya.cn/blogger/blog_main.asp?BlogID=219280

………………

从上面内容可以看到,里面保存了状态,抓取的时间,修改时间,有效期,分值,指纹,头数据等详细关于抓取的内容。

也可以使用url命令查看某个具体url的信息:

[email protected]:~/data/nutch/release-1.6/runtime/local$ ./bin/nutch readdb ./data/crawldb -url http://zzbj.tianya.cn/    
URL: http://zzbj.tianya.cn/    
Version: 7    
Status: 1 (db_unfetched)    
Fetch time: Sun Dec 08 21:42:34 CST 2013    
Modified time: Thu Jan 01 08:00:00 CST 1970    
Retries since fetch: 0    
Retry interval: 2592000 seconds (30 days)    
Score: 7.6175966E-6    
Signature: null    
Metadata:

segments

每一个segments都是一组被作为一个单元来获取的URL。segments是它本身这个目录以及它下面的子目录:

  • 一个crawl_generate确定了将要被获取的一组URL;
  • 一个crawl_fetch包含了获取的每个URL的状态;
  • 一个content包含了从每个URL获取回来的原始的内容;
  • 一个parse_text包含了每个URL解析以后的文本;
  • 一个parse_data包含来自每个URL被解析后内容中的外链和元数据;
  • 一个crawl_parse包含了外链的URL,用来更新crawldb。

这里要穿插一下,通过查看nohup.out最后内容时,发现出现异常问题:

[email protected]:~/data/nutch/release-1.6/runtime/local$ tail -n 50 nohup.out            
Parsed (1ms):http://www.tianya.cn/52491364            
Parsed (0ms):http://www.tianya.cn/55086751            
Parsed (0ms):http://www.tianya.cn/73398397            
Parsed (0ms):http://www.tianya.cn/73792451            
Parsed (0ms):http://www.tianya.cn/74299859            
Parsed (0ms):http://www.tianya.cn/76154565            
Parsed (0ms):http://www.tianya.cn/81507846            
Parsed (0ms):http://www.tianya.cn/9887577            
Parsed (0ms):http://www.tianya.cn/mobile/            
Parsed (1ms):http://xinzhi.tianya.cn/            
Parsed (0ms):http://yuqing.tianya.cn/            
ParseSegment: finished at 2013-12-08 21:42:24, elapsed: 00:00:07            
CrawlDb update: starting at 2013-12-08 21:42:24            
CrawlDb update: db: data/crawldb            
CrawlDb update: segments: [data/segments/20131208213957]            
CrawlDb update: additions allowed: true            
CrawlDb update: URL normalizing: true            
CrawlDb update: URL filtering: true            
CrawlDb update: 404 purging: false            
CrawlDb update: Merging segment data into db.            
CrawlDb update: finished at 2013-12-08 21:42:37, elapsed: 00:00:13            
LinkDb: starting at 2013-12-08 21:42:37            
LinkDb: linkdb: data/linkdb            
LinkDb: URL normalize: true            
LinkDb: URL filter: true            
LinkDb: internal links will be ignored.            
LinkDb: adding segment: file:/home/hu/data/nutch/release-1.6/runtime/local/data/segments/20131208213957            
LinkDb: adding segment: file:/home/hu/data/nutch/release-1.6/runtime/local/data/segments/20131208211101            
LinkDb: adding segment: file:/home/hu/data/nutch/release-1.6/runtime/local/data/segments/20131208213723            
LinkDb: adding segment: file:/home/hu/data/nutch/release-1.6/runtime/local/data/segments/20131208213806            
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/hu/data/nutch/release-1.6/runtime/local/data/segments/20131208211101/parse_data            
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)            
    at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)            
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)            
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989)            
    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981)            
    at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)            
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)            
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)            
    at java.security.AccessController.doPrivileged(Native Method)            
    at javax.security.auth.Subject.doAs(Subject.java:396)            
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)            
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)            
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)            
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)            
    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:180)            
    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:151)            
    at org.apache.nutch.crawl.Crawl.run(Crawl.java:143)            
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)            
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)            
[email protected]:~/data/nutch/release-1.6/runtime/local$

如上日志信息,出现此问题的原因和上一篇笔记中出现的http.agent.name问题有关,因http.agent.name问题出现异常,但仍然生成了相应的空文件目录。 解决方式也很简单,删除报错的文件夹即可。

[email protected]:~/data/nutch/release-1.6/runtime/local$ ls ./data/segments/      
20131208211101  20131208213723  20131208213806  20131208213957      
[email protected]:~/data/nutch/release-1.6/runtime/local$ rm -rf data/segments/20131208211101      
[email protected]:~/data/nutch/release-1.6/runtime/local$ ls ./data/segments/      
20131208213723  20131208213806  20131208213957      
[email protected]:~/data/nutch/release-1.6/runtime/local$

因为我们执行时的depth是3,一次爬行中每次循环都会产生一个segment,所以当前看到的是三个文件目录,Segment是有时限的,当这 些网页被Crawler重新抓取后,先前抓取产生的segment就作废了。在存储中,Segment文件夹是以产生时间命名的,方便我们删除作废的 segments以节省存储空间。我们可以看下每个文件目录下有哪些内容:

[email protected]:~/data/nutch/release-1.6/runtime/local$ ls data/segments/20131208213723/      
content  crawl_fetch  crawl_generate  crawl_parse  parse_data  parse_text

可以看到,一个segment包括以下子目录(多是二进制格式):

content:包含每个抓取页面的内容

crawl_fetch:包含每个抓取页面的状态    
crawl_generate:包含所抓取的网址列表    
crawl_parse:包含网址的外部链接地址,用于更新crawldb数据库    
parse_data:包含每个页面的外部链接和元数据    
parse_text:包含每个抓取页面的解析文本

每个文件的生成时间

1.crawl_generate在Generator的时候生成;    
2.content,crawl_fetch在Fetcher的时候生成;    
3.crawl_parse,parse_data,parse_text在Parse segment的时候生成。

如何查看每个文件的内容呢,如想查看content中抓取的网页源码内容,这个在本文后面会有介绍。

linkdb

linkdb: 链接数据库,用来存储每个网址的链接地址,包括源地址和链接地址。

由于http.agent.name原因,linkdb中内容插入失败,我重新执行了下爬虫命令,下面是执行完成后nohup.out中末尾日志信息:


。。。。。。

ParseSegment: finished at 2014-01-11 16:30:22, elapsed: 00:00:07            
CrawlDb update: starting at 2014-01-11 16:30:22            
CrawlDb update: db: data/crawldb            
CrawlDb update: segments: [data/segments/20140111162513]            
CrawlDb update: additions allowed: true            
CrawlDb update: URL normalizing: true            
CrawlDb update: URL filtering: true            
CrawlDb update: 404 purging: false            
CrawlDb update: Merging segment data into db.            
CrawlDb update: finished at 2014-01-11 16:30:35, elapsed: 00:00:13            
LinkDb: starting at 2014-01-11 16:30:35            
LinkDb: linkdb: data/linkdb            
LinkDb: URL normalize: true            
LinkDb: URL filter: true            
LinkDb: internal links will be ignored.            
LinkDb: adding segment: file:/home/hu/data/nutch/release-1.6/runtime/local/data/segments/20140111162237            
LinkDb: adding segment: file:/home/hu/data/nutch/release-1.6/runtime/local/data/segments/20140111162320            
LinkDb: adding segment: file:/home/hu/data/nutch/release-1.6/runtime/local/data/segments/20140111162513            
LinkDb: finished at 2014-01-11 16:30:48, elapsed: 00:00:13            
crawl finished: data

现在可以查看linkdb中内容信息了

[email protected]:~/data/nutch/release-1.6/runtime/local$ bin/nutch readlinkdb ./data/linkdb -dump crawl_tianya_out_linkdb      
LinkDb dump: starting at 2014-01-11 16:39:42    
LinkDb dump: db: ./data/linkdb    
LinkDb dump: finished at 2014-01-11 16:39:49, elapsed: 00:00:07    
[email protected]:~/data/nutch/release-1.6/runtime/local$ ls ./crawl_tianya_out_linkdb/    
part-00000    
[email protected]:~/data/nutch/release-1.6/runtime/local$ head -n 10 ./crawl_tianya_out_linkdb/part-00000    
http://100w.tianya.cn/    Inlinks:    
fromUrl: http://star.tianya.cn/ anchor: [2012第一美差]    
fromUrl: http://star.tianya.cn/ anchor: 2013第一美差

http://aimin_001.blog.tianya.cn/    Inlinks:    
fromUrl: http://blog.tianya.cn/blog/mingbo anchor: 长沙艾敏    
fromUrl: http://blog.tianya.cn/ anchor: 长沙艾敏

http://alice.tianya.cn/    Inlinks:    
fromUrl: http://bj.tianya.cn/ anchor:    
[email protected]:~/data/nutch/release-1.6/runtime/local$

可以看到有的网页有多个Inlinks,这说明网页的重要性越大。和分值的确定有直接关系。比如一个网站的首页就会有很多的Inlinks。

其他信息查看:

1.根据需要可以通过命令查看抓取运行的相关讯息

[email protected]:~/data/nutch/release-1.6/runtime/local$ cat nohup.out | grep elapsed      
Injector: finished at 2013-12-08 21:10:53, elapsed: 00:00:14    
Generator: finished at 2013-12-08 21:11:08, elapsed: 00:00:15    
Injector: finished at 2013-12-08 21:37:15, elapsed: 00:00:17    
Generator: finished at 2013-12-08 21:37:30, elapsed: 00:00:15    
Fetcher: finished at 2013-12-08 21:37:37, elapsed: 00:00:07    
ParseSegment: finished at 2013-12-08 21:37:45, elapsed: 00:00:07    
CrawlDb update: finished at 2013-12-08 21:37:58, elapsed: 00:00:13    
Generator: finished at 2013-12-08 21:38:13, elapsed: 00:00:15    
Fetcher: finished at 2013-12-08 21:39:29, elapsed: 00:01:16    
ParseSegment: finished at 2013-12-08 21:39:36, elapsed: 00:00:07    
CrawlDb update: finished at 2013-12-08 21:39:49, elapsed: 00:00:13    
Generator: finished at 2013-12-08 21:40:04, elapsed: 00:00:15    
Fetcher: finished at 2013-12-08 21:42:17, elapsed: 00:02:13    
ParseSegment: finished at 2013-12-08 21:42:24, elapsed: 00:00:07    
CrawlDb update: finished at 2013-12-08 21:42:37, elapsed: 00:00:13    
Injector: finished at 2014-01-11 16:22:29, elapsed: 00:00:14    
Generator: finished at 2014-01-11 16:22:45, elapsed: 00:00:15    
Fetcher: finished at 2014-01-11 16:22:52, elapsed: 00:00:07    
ParseSegment: finished at 2014-01-11 16:22:59, elapsed: 00:00:07    
CrawlDb update: finished at 2014-01-11 16:23:12, elapsed: 00:00:13    
Generator: finished at 2014-01-11 16:23:27, elapsed: 00:00:15    
Fetcher: finished at 2014-01-11 16:24:48, elapsed: 00:01:21    
ParseSegment: finished at 2014-01-11 16:24:55, elapsed: 00:00:07    
CrawlDb update: finished at 2014-01-11 16:25:05, elapsed: 00:00:10    
Generator: finished at 2014-01-11 16:25:20, elapsed: 00:00:15    
Fetcher: finished at 2014-01-11 16:30:15, elapsed: 00:04:54    
ParseSegment: finished at 2014-01-11 16:30:22, elapsed: 00:00:07    
CrawlDb update: finished at 2014-01-11 16:30:35, elapsed: 00:00:13    
LinkDb: finished at 2014-01-11 16:30:48, elapsed: 00:00:13

2.查看segments目录下content内容,上面我们提到content内容是抓取时网页的源码内容,但因为是二进制的无法直接查看,不过nutch提供了相应的查看方式:

[email protected]:~/data/nutch/release-1.6/runtime/local$ bin/nutch readseg    
Usage: SegmentReader (-dump ... | -list ... | -get ...) [general options]

* General options:    
    -nocontent    ignore content directory    
    -nofetch    ignore crawl_fetch directory    
    -nogenerate    ignore crawl_generate directory    
    -noparse    ignore crawl_parse directory    
    -noparsedata    ignore parse_data directory    
    -noparsetext    ignore parse_text directory

* SegmentReader -dump <segment_dir> <output> [general options]    
  Dumps content of a <segment_dir> as a text file to <output>.

<segment_dir>    name of the segment directory.    
    <output>    name of the (non-existent) output directory.

* SegmentReader -list (<segment_dir1> ... | -dir <segments>) [general options]    
  List a synopsis of segments in specified directories, or all segments in    
  a directory <segments>, and print it on System.out

<segment_dir1> ...    list of segment directories to process    
    -dir <segments>        directory that contains multiple segments

* SegmentReader -get <segment_dir> <keyValue> [general options]    
  Get a specified record from a segment, and print it on System.out.

<segment_dir>    name of the segment directory.    
    <keyValue>    value of the key (url).    
        Note: put double-quotes around strings with spaces.

查看 content:

content包含了从每个URL获取回来的原始的内容。    
[email protected]:~/data/nutch/release-1.6/runtime/local$ bin/nutch readseg -dump ./data/segments/20140111162237  ./data/crawl_tianya_seg_content -nofetch -nogenerate -noparse -noparsedata -noparsetext      
SegmentReader: dump segment: data/segments/20140111162237    
SegmentReader: done    
[email protected]:~/data/nutch/release-1.6/runtime/local$ ls ./data/crawl_tianya_seg_content/      
dump       .dump.crc 
[email protected]:~/data/nutch/release-1.6/runtime/local$ head -n 50 ./data/crawl_tianya_seg_content/dump

Recno:: 0    
URL:: http://blog.tianya.cn/

Content::    
Version: -1    
url: http://blog.tianya.cn/    
base: http://blog.tianya.cn/    
contentType: text/html    
metadata: Date=Sat, 11 Jan 2014 08:22:46 GMT Vary=Accept-Encoding Expires=Thu, 01 Nov 2012 10:00:00 GMT Content-Encoding=gzip nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20140111162237 Content-Type=text/html; charset=UTF-8 Connection=close Server=nginx Cache-Control=no-cache Pragma=no-cache    
Content:

<!DOCTYPE HTML>    
<html>    
<head>    
<meta charset="utf-8">    
<title>天涯博客_有见识的人都在此</title>    
<meta name="keywords" content="天涯,博客,天涯博客,天涯社区,天涯论坛,意见领袖" />    
<meta name="description" content="天涯博客是天涯社区开办的独立博客平台,这里可以表达网民立场,聚集意见领袖,众多草根精英以他们的观点影响社会的进程。天涯博客,有见识的人都在此!" />

<link href="http://static.tianyaui.com/global/ty/TY.css" rel="stylesheet" type="text/css" />    
<link href="http://static.tianyaui.com/global/blog/web/static/css/blog_56de4ad.css" rel="stylesheet" type="text/css" />    
<link rel="shortcut icon" href="http://static.tianyaui.com/favicon.ico" type="image/vnd.microsoft.icon" />    
<script type="text/javascript" charset="utf-8" src="http://static.tianyaui.com/global/ty/TY.js"></script>    
<!--[if lt IE 7]>    
  <script src="http://static.tianyaui.com/global/ty/util/image/DD_belatedPNG_0.0.8a.js?v=2013101509" type="text/javascript"></script>    
<![endif]-->    
</head>    
<body>    
<div id="huebox" >    
   
<script type="text/javascript" charset="utf-8">TY.loader("TY.ui.nav",function(){TY.ui.nav.init ({app_str:‘blog‘,topNavWidth: 1000,showBottomNav:false});});</script>

<div id="blogdoc" class="blogdoc blogindex">    
    <div id="hd"></div>    
    <div id="bd" class="layout-lmr clearfix">    
        <div id="left">    
           
           
<div class="sub-nav left-mod">    
    <ul class="text-list-2">    
        <li class="curr"><a class="ico-1" href="http://blog.tianya.cn/">博客首页</a></li>    
        <li class=""><a href="/blog/society">社会民生</a></li>    
        <li class=""><a href="/blog/international">国际观察</a></li>    
        <li class=""><a href="/blog/ent">娱乐</a></li>    
        <li class=""><a href="/blog/sports">体育</a></li>    
        <li class=""><a href="/blog/culture">文化</a></li>    
        <li class=""><a href="/blog/history">历史</a></li>    
        <li class=""><a href="/blog/life">生活</a></li>    
        <li class=""><a href="/blog/emotion">情感</a></li>    
[email protected]:~/data/nutch/release-1.6/runtime/local$

我们也可以采取同样的方式查看其他文件内容,如crawl_fetch,parse_data等。

查看crawl_fetch:

crawl_fetch包含了获取的每个URL的状态。

[email protected]:~/data/nutch/release-1.6/runtime/local$ bin/nutch readseg -dump ./data/segments/20140111162237  ./data/crawl_tianya_seg_fetch -nocontent -nogenerate -noparse -noparsedata -noparsetext    
SegmentReader: dump segment: data/segments/20140111162237    
SegmentReader: done    
[email protected]:~/data/nutch/release-1.6/runtime/local$ head -n 50 ./data/crawl_tianya_seg_fetch/dump

Recno:: 0    
URL:: http://blog.tianya.cn/

CrawlDatum::    
Version: 7    
Status: 33 (fetch_success)    
Fetch time: Sat Jan 11 16:22:46 CST 2014    
Modified time: Thu Jan 01 08:00:00 CST 1970    
Retries since fetch: 0    
Retry interval: 2592000 seconds (30 days)    
Score: 1.0    
Signature: null    
Metadata: _ngt_: 1389428549880Content-Type: text/html_pst_: success(1), lastModified=0

[email protected]:~/data/nutch/release-1.6/runtime/local$

查看crawl_generate:

crawl_generate确定了将要被获取的一组URL。

[email protected]:~/data/nutch/release-1.6/runtime/local$ bin/nutch readseg -dump ./data/segments/20140111162237  ./data/crawl_tianya_seg_generate -nocontent -nofetch -noparse -noparsedata -noparsetext    
SegmentReader: dump segment: data/segments/20140111162237    
SegmentReader: done    
[email protected]:~/data/nutch/release-1.6/runtime/local$ head -n 50 ./data/crawl_tianya_seg_generate/dump

Recno:: 0    
URL:: http://blog.tianya.cn/

CrawlDatum::    
Version: 7    
Status: 1 (db_unfetched)    
Fetch time: Sat Jan 11 16:22:15 CST 2014    
Modified time: Thu Jan 01 08:00:00 CST 1970    
Retries since fetch: 0    
Retry interval: 2592000 seconds (30 days)    
Score: 1.0    
Signature: null    
Metadata: _ngt_: 1389428549880

查看crawl_parse:

crawl_parse包含了外链的URL,用来更新crawldb。

[email protected]:~/data/nutch/release-1.6/runtime/local$ bin/nutch readseg -dump ./data/segments/20140111162237  ./data/crawl_tianya_seg_parse -nofetch -nogenerate -nocontent –noparsedata –noparsetext    
SegmentReader: dump segment: data/segments/20140111162237    
SegmentReader: done    
[email protected]:~/data/nutch/release-1.6/runtime/local$ head -n 50 ./data/crawl_tianya_seg_parse/dump

Recno:: 0    
URL:: http://aimin_001.blog.tianya.cn/

CrawlDatum::    
Version: 7    
Status: 67 (linked)    
Fetch time: Sat Jan 11 16:22:55 CST 2014    
Modified time: Thu Jan 01 08:00:00 CST 1970    
Retries since fetch: 0    
Retry interval: 2592000 seconds (30 days)    
Score: 0.016949153    
Signature: null    
Metadata:

Recno:: 1    
URL:: http://anger.blog.tianya.cn/

CrawlDatum::    
Version: 7    
Status: 67 (linked)    
Fetch time: Sat Jan 11 16:22:55 CST 2014    
Modified time: Thu Jan 01 08:00:00 CST 1970    
Retries since fetch: 0    
Retry interval: 2592000 seconds (30 days)    
Score: 0.016949153    
Signature: null    
Metadata:

。。。。

查看parse_data:

parse_data包含来自每个URL被解析后内容中的外链和元数据。

[email protected]:~/data/nutch/release-1.6/runtime/local$ bin/nutch readseg –dump ./data/segments/20140111162237  ./data/crawl_tianya_seg_data -nofetch -nogenerate -nocontent -noparse –noparsetext    
SegmentReader: dump segment: data/segments/20140111162237    
SegmentReader: done    
[email protected]:~/data/nutch/release-1.6/runtime/local$ head -n 50 ./data/crawl_tianya_seg_data/dump

Recno:: 0    
URL:: http://blog.tianya.cn/

ParseData::    
Version: 5    
Status: success(1,0)    
Title: 天涯博客_有见识的人都在此    
Outlinks: 59    
  outlink: toUrl: http://blog.tianya.cn/blog/society anchor: 社会民生    
  outlink: toUrl: http://blog.tianya.cn/blog/international anchor: 国际观察    
  outlink: toUrl: http://blog.tianya.cn/blog/ent anchor: 娱乐    
  outlink: toUrl: http://blog.tianya.cn/blog/sports anchor: 体育    
  outlink: toUrl: http://blog.tianya.cn/blog/culture anchor: 文化    
  outlink: toUrl: http://blog.tianya.cn/blog/history anchor: 历史    
  outlink: toUrl: http://blog.tianya.cn/blog/life anchor: 生活    
  outlink: toUrl: http://blog.tianya.cn/blog/emotion anchor: 情感    
  outlink: toUrl: http://blog.tianya.cn/blog/finance anchor: 财经    
  outlink: toUrl: http://blog.tianya.cn/blog/stock anchor: 股市    
  outlink: toUrl: http://blog.tianya.cn/blog/food anchor: 美食    
  outlink: toUrl: http://blog.tianya.cn/blog/travel anchor: 旅游    
  outlink: toUrl: http://blog.tianya.cn/blog/newPush anchor: 最新博文    
  outlink: toUrl: http://blog.tianya.cn/blog/mingbo anchor: 天涯名博    
  outlink: toUrl: http://blog.tianya.cn/blog/daren anchor: 博客达人    
  outlink: toUrl: http://www.tianya.cn/mobile anchor:    
  outlink: toUrl: http://bbs.tianya.cn/post-1018-1157-1.shtml anchor: 天涯“2013年度十大深度影响力博客”名单    
  outlink: toUrl: http://jingyibaobei.blog.tianya.cn/ anchor: 烟花少爷    
  outlink: toUrl: http://lljjasmine.blog.tianya.cn/ anchor: 寻梦的冰蝶

。。。。

查看parse_text:

parse_text包含了每个URL解析以后的文本。

[email protected]:~/data/nutch/release-1.6/runtime/local$ bin/nutch readseg -dump ./data/segments/20140111162237  ./data/crawl_tianya_seg_text -nofetch -nogenerate -nocontent -noparse -noparsedata      
SegmentReader: dump segment: data/segments/20140111162237    
SegmentReader: done    
[email protected]:~/data/nutch/release-1.6/runtime/local$ head -n 50 ./data/crawl_tianya_seg_text/dump

Recno:: 0    
URL:: http://blog.tianya.cn/

ParseText::    
天涯博客_有见识的人都在此 博客首页 社会民生 国际观察 娱乐 体育 文化 历史 生活 情感 财经 股市 美食 旅游 最新博文 天涯名博 博客达人 博客总排行 01 等待温暖的小狐狸 44887595 02 潘文伟 34654676 03 travelisliving 30676532 04 股市掘金 28472831 05 crystalkitty 26283927 06 yuwenyufen 24880887 07 水莫然 24681174 08 李泽辉 22691445 09 钟巍巍 19226129 10 别境 17752691 11 微笑的说我很幸 15912882 12 尤宇 15530802 13 sundaes 14961321 14 郑渝川 14219498 15 黑花黄 13174656 博文排行 01 任志强戳穿“央视十宗罪”都 02 野云先生:钱眼里的文化(5 03 是美女博士征男友还是媒体博 04 黄牛永远走在时代的最前沿 05 “与女优度春宵”怎成员工年 06 如何看待对张艺谋罚款748万 07 女保姆酒后色诱我上床被妻撞 08 年过不惑的男人为何对婚姻也 09 明代变态官员囚多名尼姑做性 10 女人不肯承认的20个秘密 社会排行 国际排行 01 风青杨:章子怡“七亿陪睡案 02 潘金云和她的脑瘫孩子们。 03 人民大学前校长纪宝成腐败之 04 小学语文课本配图错误不是小 05 闲聊“北京地铁要涨价” 06 “高压”整治火患之后,还该 07 警惕父母误导孩子的十种不良 08 一代名伶红线女为什么如此红 09 黎明:应明令禁止官员技侦发 10 官二代富二代的好运气不能独 01 【环球热点】如此奢华—看了 02 阿基诺的民,阿基诺的心,阿 03 “中国向菲律宾捐款10万美元 04 美国法律界:对青少年犯罪的 05 一语中的:诺贝尔奖得主锐评 06 300万元保证金骗到武汉公司1 07 乱而取之的智慧 08 中国连宣泄愤怒都有人“代表 09 世界啊,请醒醒吧,都被美元 10 反腐利器呼之欲出,贪腐官员 娱乐排行 体育排行 01 2013网络票选新宅男女神榜单 02 从《千金归来》看中国电视剧 03 汪峰自称是好爸爸时大家都笑 04 黄圣依称杨子是靠山打了谁的 05 汪峰连锁型劣迹被爆遭六六嘲 06 舒淇深V礼服到肚脐令人窒息 07 张柏芝交老外新欢照曝光(图 08 吴奇隆公开恋情众网友送祝福 09 独家:赵本山爱女妞妞练功美 10 “帮汪峰上头条”背后的注意 01 道歉信和危机公关 02 【环球热点】鸟人(视频) 03 曼联宣布维迪奇已出院 04 哈登,别让假摔毁了形象。。。。。。

也可以统一放到一个文件中去查看:

[email protected]:~/data/nutch/release-1.6/runtime/local$ bin/nutch readseg -dump ./data/segments/20140111162237  ./data/segments/20140111162237_dump    
SegmentReader: dump segment: data/segments/20140111162237    
SegmentReader: done    
[email protected]:~/data/nutch/release-1.6/runtime/local$ less ./data/segments/20140111162237_dump/dump

3.通过list,get列出segments一些统计信息

[email protected]:~/data/nutch/release-1.6/runtime/local$ bin/nutch readseg -list -dir data/segments    
NAME GENERATED FETCHER START FETCHER END FETCHED PARSED    
20140111162237 1 2014-01-11T16:22:46 2014-01-11T16:22:46 1 1    
20140111162320 57 2014-01-11T16:23:27 2014-01-11T16:24:43 58 19    
20140111162513 135 2014-01-11T16:25:21 2014-01-11T16:30:09 140 102

[email protected]:~/data/nutch/release-1.6/runtime/local$ bin/nutch readseg -list  data/segments/20140111162320/      
NAME        GENERATED    FETCHER START        FETCHER END        FETCHED    PARSED    
20140111162320    57        2014-01-11T16:23:27    2014-01-11T16:24:43    58    19

[email protected]:~/data/nutch/release-1.6/runtime/local$ bin/nutch readseg -get  data/segments/20140111162513 http://100w.tianya.cn/    
SegmentReader: get ‘http://100w.tianya.cn/‘    
Crawl Parse::    
Version: 7    
Status: 67 (linked)    
Fetch time: Sat Jan 11 16:30:18 CST 2014    
Modified time: Thu Jan 01 08:00:00 CST 1970    
Retries since fetch: 0    
Retry interval: 2592000 seconds (30 days)    
Score: 1.8224896E-6    
Signature: null    
Metadata:

4.通过readdb及topN参数命令查看按分值排序的url

(1).这里我设定的条件为:前10条,分值大于1

[email protected]:~/data/nutch/release-1.6/runtime/local$ ./bin/nutch readdb ./data/crawldb -topN 10 ./data/crawldb_topN 1    
CrawlDb topN: starting (topN=10, min=1.0)    
CrawlDb db: ./data/crawldb    
CrawlDb topN: collecting topN scores.    
CrawlDb topN: done    
[email protected]:~/data/nutch/release-1.6/runtime/local$ cat ./data/crawldb_topN/part-00000    
1.0140933    http://blog.tianya.cn/    
[email protected]:~/data/nutch/release-1.6/runtime/local$

(2).不设分值条件,查询前10条

[email protected]:~/data/nutch/release-1.6/runtime/local$ ./bin/nutch readdb ./data/crawldb -topN 10 ./data/crawldb_topN_all_score    
CrawlDb topN: starting (topN=10, min=0.0)    
CrawlDb db: ./data/crawldb    
CrawlDb topN: collecting topN scores.    
CrawlDb topN: done    
[email protected]:~/data/nutch/release-1.6/runtime/local$ cat ./data/crawldb_topN_all_score/part-00000    
1.0140933    http://blog.tianya.cn/    
0.046008706    http://blog.tianya.cn/blog/society    
0.046008706    http://blog.tianya.cn/blog/international    
0.030586869    http://blog.tianya.cn/blog/mingbo    
0.030586869    http://blog.tianya.cn/blog/daren    
0.030330064    http://www.tianya.cn/mobile    
0.029951613    http://blog.tianya.cn/blog/culture    
0.029951613    http://blog.tianya.cn/blog/history    
0.029951613    http://blog.tianya.cn/blog/life    
0.029951613    http://blog.tianya.cn/blog/stock    
[email protected]:~/data/nutch/release-1.6/runtime/local$

---------------------------

nutch抓取过程简析今天就记录到这。

参考:

http://yangshangchuan.iteye.com/category/275433

http://www.oschina.net/translate/nutch-tutorial  Nutch 教程

http://wenku.baidu.com/view/866583e90975f46527d3e1f3.html Nutch入门教程.pdf

时间: 2024-10-13 01:31:20

Nutch学习笔记——抓取过程简析的相关文章

python学习笔记-抓取网页图片脚本

初学者一枚,代码都是模仿网上的.亲测可用~ 运行脚本的前提是本机安装了httplib2模块 #!/usr/bin/python import os import re import string import urllib #author:reed #date:2014-05-14 def GetWebPictures(): url=raw_input('please input the website you want to download:') imgcontent=urllib.urlo

[Nutch]Nutch抓取过程中生成的目录内容分析

在上一篇博文中有和大家介绍了nutch爬虫抓取数据的整个过程,爬虫一般会抓取到很多的内容,那么这些内容都存放到什么地方了呢?其实nutch在抓取的过程中会产生很多的目录,会把抓到的内容分别保存到不同的目录之中.那么,这些目录的结构的什么样的?每个目录里面又保存了哪些内容呢?本篇博文将为你揭晓. 从上一篇博文我们可以知道,nutch爬虫在执行数据抓取的过程中,在data目录下面有crawldb和segments两个目录: 下面我们对这两个目录里面的内容做详细的介绍: 1. crawldb craw

十五、Android学习笔记_授权过程

1.需要申请App Key和App Secret.不同的开发平台有不同的接入方式,可以参考文档,然后将这两个值放进去. 2.通过OAuth类实现认证,它会自动跳转到认证界面,进行授权,成功之后需要处理回调接口. 3.在第二步调用回调接口时,它会返回用户的基本信息,比如用户id.此时需要将用户id信息保存起来,为后面登录做准备.回调接口的写法就为myapp://AuthorizeActivity,其中scheme全部为小写字母. <activity android:name="com.wei

swift学习笔记(六)析构过程和使用闭包对属性进行默认值赋值

一.通过闭包和函数实现属性的默认值 当某个存储属性的默认值需要定制时,可以通过闭包或全局函数来为其提供定制的默认值. 注:全局函数结构体和枚举使用关键字static标注    函数则使用class关键字标注 当对一个属性使用闭包函数进行赋值时,每当此属性所述的类型被创建实例时,对应的闭包或函数会被调用,而他们的返回值会被作为属性的默认值. ESC: Class SomeCLass{ let someProperty:SomeType={ //给someProperty赋一个默认值 //返回一个与

WEB中调用Nutch执行JOB抓取

把nutch的源代码导入到eclipse工程自定义抓取任务. 下载源码: http://svn.apache.org/repos/asf/nutch/ 从svn下载想要的nutch源码,这里选择nutch-1.1 编译源码: 使用ant编译源代码,编译成功,可以看到多了一个build目录,其中有plugins目录及nutch-1.1.job文件 新建WEB工程 新建web工程org.apache.nutch.web,执行以下操作 1.      把nutch源代码的src/java 目录复制到w

Nutch学习笔记11---1.7local模式启用压缩算法

压缩优化 由于hadoop的很多结果都是由mr触发,mr中间伴随着很多硬盘IO. 所以这里需要启用压缩算法,减少IO数据量,减少IO时间. 症状表现:运行时看到 2014-07-14 18:13:09,386 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-07-14 18:

静觅爬虫学习笔记8-爬取猫眼电影

不知道是不是我学习太晚的原因,猫眼电影这网站我用requests进行爬取源码直接返回给我一个您的访问被禁止.作为萌新的我登时就傻了,还好认真听了之前的课,直接换selenium抓了源码,虽然效率惨不忍睹,但多少也能运行了,下面上代码 import json import requests import re from requests.exceptions import RequestException from multiprocessing import Pool from selenium

python学习之抓取猫眼电影Top100榜单

目录 1 本篇目标 2 url分析 3 页面抓取 4 页面分析 5 代码整合 6 优化 参考: 近期开始学习python爬虫,熟悉了基本库.解析库之后,决定做个小Demo来实践下,检验学习成果. 1 本篇目标 抓取猫眼电影总排行榜Top100电影单 根据电影演员表统计演员上榜次数 2 url分析 目标站点为https://maoyan.com/board/4,打开之后就可以看到排行榜信息,如图所示 页面上显示10部电影,有名次.影片名称.演员信息等信息.当拉到最下面点击第二页的时候,发现url变

解决Jsoup网页抓取过程中需要cookie的问题

最近在做城觅网的信息抓取,发现城觅网上海与北京的url是一样的.那怎样才确定信息的来源呢?折腾了半天,才发现城觅网是使用cookie的,如果你把网站的cookie禁用了,就无法在上海与北京之间切换了. 于是便想到了请求时将cookie带上.方法如下: 第一步,拿到上海或者北京的cookie Map<String, String> cookies = null; Response res = Jsoup.connect("http://www.chengmi.com/shanghai&