译文来着:
http://wiki.apache.org/nutch/Crawl
介绍(Introduction)
注意:脚本中没有直接使用Nutch的爬去命令(bin/nutch crawl或者是“Crawl”类),所以url过滤的实现并不依赖“conf/crawl-urlfilter.txt”,而是应该在“regex-urlfilter.txt”中设定实现。
爬取步骤(Steps)
脚本大致分为8部:
- Inject URLs(注入urls)
- Generate, Fetch, Parse, Update Loop(循环执行:产生待抓取URL,抓取,转换得到的页面,更新各DB)
- Merge Segments(合并segments)
- Invert Links(得到抓取到的页面的外连接数据)
- Index(索引)
- Dedup(去重)
- Merge Indexes(合并索引)
- Load new indexes(tomcat重新加载新索引目录)
两种执行模式(Modes of Execution)
脚本可以两种模式执行:-
- Normal Mode(普通模式)
- Safe Mode(安全模式)
Normal Mode
用 ‘bin/runbot‘命令执行, 将删除执行后所有的目录。
注意: 这意味着如果抓取过程因某些原因中断,而且crawl DB 是不完整的, 那么将没办法恢复。
Safe Mode
用‘bin/runbot safe‘ 命令执行安全模式,将不会删除用到的目录文件. 所有临时文件将被以"BACK_FILE"备份。如果出错,可以利用这些备份文件执行恢复操作。
Normal Mode vs. Safe Mode
除非你可以保证一切都不出问题,否则我们建议您执行安全模式。
Tinkering
根据你的需要设定 ‘depth‘, ‘threads‘, ‘adddays‘ and ‘topN‘。如果不想设定‘topN‘,就将其注释掉或者删掉。
NUTCH_HOME
如果你不是在 nutch的‘bin/runbot‘ 目录下执行该脚本, 你应该在脚本中设定 ‘NUTCH_HOME‘ 的值为你的nutch路径:-
if [ -z "$NUTCH_HOME" ] then NUTCH_HOME=.
ps:如果你在环境变量中已经设定了 ‘NUTCH_HOME‘的值,则可以忽略此处。
CATALINA_HOME
‘CATALINA_HOME‘ 指向tomcat的安装路径。需要在脚本或者环境变量中对其设置,类似 ‘NUTCH_HOME‘的设定:-
if [ -z "$CATALINA_HOME" ] then CATALINA_HOME=/opt/apache-tomcat-6.0.10
Can it re-crawl?
Can it re-crawl?
虽然作者自己使用过多次,但是否能够适合你的工作,请先测试一下。如果不能很好的执行重爬,请联系我们。
脚本内容(Script)
# runbot script to run the Nutch bot for crawling and re-crawling. # Usage: bin/runbot [safe] # If executed in ‘safe‘ mode, it doesn‘t delete the temporary # directories generated during crawl. This might be helpful for # analysis and recovery in case a crawl fails. # # Author: Susam Pal depth=2 threads=5 adddays=5 topN=15 #Comment this statement if you don‘t want to set topN value # Arguments for rm and mv RMARGS="-rf" MVARGS="--verbose" # Parse arguments if [ "$1" == "safe" ]#判断是以哪种模式执行 then safe=yes fi if [ -z "$NUTCH_HOME" ]#判断 ‘NUTCH_HOME‘是否设定 then NUTCH_HOME=. echo runbot: $0 could not find environment variable NUTCH_HOME echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script else echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME fi if [ -z "$CATALINA_HOME" ]#判断tomcat路径是否设置 then CATALINA_HOME=/opt/apache-tomcat-6.0.10 echo runbot: $0 could not find environment variable NUTCH_HOME echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script else echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME fi if [ -n "$topN" ]#topN设定 then topN="-topN $topN" else topN="" fi steps=8 echo "----- Inject (Step 1 of $steps) -----"#注入种子urls $NUTCH_HOME/bin/nutch inject crawl/crawldb urls echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"#循环执行抓取 for((i=0; i < $depth; i++)) do echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---" $NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments $topN -adddays $adddays if [ $? -ne 0 ] then echo "runbot: Stopping at depth $depth. No more URLs to fetch." break fi segment=`ls -d crawl/segments/* | tail -1` $NUTCH_HOME/bin/nutch fetch $segment -threads $threads if [ $? -ne 0 ] then echo "runbot: fetch $segment at depth `expr $i + 1` failed." echo "runbot: Deleting segment $segment." rm $RMARGS $segment continue fi $NUTCH_HOME/bin/nutch updatedb crawl/crawldb $segment done echo "----- Merge Segments (Step 3 of $steps) -----"#合并Segments $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/* if [ "$safe" != "yes" ] then rm $RMARGS crawl/segments else rm $RMARGS crawl/BACKUPsegments mv $MVARGS crawl/segments crawl/BACKUPsegments fi mv $MVARGS crawl/MERGEDsegments crawl/segments echo "----- Invert Links (Step 4 of $steps) -----"#得到外连接数据 $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb crawl/segments/* echo "----- Index (Step 5 of $steps) -----"#建索引 $NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb crawl/linkdb crawl/segments/* echo "----- Dedup (Step 6 of $steps) -----"#去重 $NUTCH_HOME/bin/nutch dedup crawl/NEWindexes echo "----- Merge Indexes (Step 7 of $steps) -----"#合并索引 $NUTCH_HOME/bin/nutch merge crawl/NEWindex crawl/NEWindexes echo "----- Loading New Index (Step 8 of $steps) -----"#tomcat重新加载索引目录 ${CATALINA_HOME}/bin/shutdown.sh if [ "$safe" != "yes" ] then rm $RMARGS crawl/NEWindexes rm $RMARGS crawl/index else rm $RMARGS crawl/BACKUPindexes rm $RMARGS crawl/BACKUPindex mv $MVARGS crawl/NEWindexes crawl/BACKUPindexes mv $MVARGS crawl/index crawl/BACKUPindex fi mv $MVARGS crawl/NEWindex crawl/index ${CATALINA_HOME}/bin/startup.sh echo "runbot: FINISHED: Crawl completed!" echo ""
时间: 2024-10-12 14:19:59