在Crawl中的main函数中有一句是:
// initializecrawlDb
injector.inject(crawlDb, rootUrlDir);
引用[李阳]:inject操作调用的是nutch的核心包之一crawl包中的类Injector。
inject操作主要作用:
1. 将URL集合进行格式化和过滤,消除其中的非法URL,并设定URL状态(UNFETCHED),按照一定方法进行初始化分值;
2. 将URL进行合并,消除重复的URL入口;
3. 将URL及其状态、分值存入crawldb数据库,与原数据库中重复的则删除旧的,更换新的。
inject操作结果:crawldb数据库内容得到更新,包括URL及其状态。
看一下inject调用的函数:
public
voidinject(Path crawlDb, Path urlDir) throwsIOException {
//产生一个文件名是随机的临时文件夹
Path tempDir = newPath(getConf().get("mapred.temp.dir",
".")
+ "/inject-temp-"
+ Integer.toString(new
Random().nextInt(Integer.MAX_VALUE)));
// map text input file to a<url,CrawlDatum> file
// 产生<url,CrawlDatum>key-value对的文件
JobConf sortJob = newNutchJob(getConf());
sortJob.setJobName("inject" + urlDir);
FileInputFormat.addInputPath(sortJob,urlDir);
sortJob.setMapperClass(InjectMapper.class);
FileOutputFormat.setOutputPath(sortJob,tempDir);
sortJob.setOutputFormat(SequenceFileOutputFormat.class);
sortJob.setOutputKeyClass(Text.class);
sortJob.setOutputValueClass(CrawlDatum.class);
sortJob.setLong("injector.current.time",
System.currentTimeMillis());
JobClient.runJob(sortJob);
这里用的是hadoop的东西,输入文件目录为:用户指定的url目录。输出目录为:产生的那个临时文件夹。这里的SequenceFileOutputFormat在<Hadoop,The definitive book>中的解释为:Imagine a logfile,where each log
http://c.tieba.baidu.com/p/3476808306
http://c.tieba.baidu.com/p/3476798710
http://c.tieba.baidu.com/p/3474281354
http://c.tieba.baidu.com/p/3474300101
http://c.tieba.baidu.com/p/3474294075
http://c.tieba.baidu.com/p/3474123295
http://c.tieba.baidu.com/p/3474314242
http://c.tieba.baidu.com/p/3474310411
http://c.tieba.baidu.com/p/3474304550
http://c.tieba.baidu.com/p/3475433945
http://c.tieba.baidu.com/p/3475430015
http://c.tieba.baidu.com/p/3475433348
http://c.tieba.baidu.com/p/3475431434
http://c.tieba.baidu.com/p/3474176863
http://c.tieba.baidu.com/p/3474159835
http://c.tieba.baidu.com/p/3474163941
http://c.tieba.baidu.com/p/3474156121
http://c.tieba.baidu.com/p/3474147660
http://c.tieba.baidu.com/p/3474151899
http://c.tieba.baidu.com/p/3474142287
http://c.tieba.baidu.com/p/3474136965
http://c.tieba.baidu.com/p/3474133165
http://c.tieba.baidu.com/p/3474128675
http://c.tieba.baidu.com/p/3474103896
http://c.tieba.baidu.com/p/3474099488
http://c.tieba.baidu.com/p/3474094120
http://c.tieba.baidu.com/p/3475431976
http://c.tieba.baidu.com/p/3474267991
http://c.tieba.baidu.com/p/3474259583
http://c.tieba.baidu.com/p/3474254990
http://c.tieba.baidu.com/p/3474228986
http://c.tieba.baidu.com/p/3474221626
http://c.tieba.baidu.com/p/3474215742
http://c.tieba.baidu.com/p/3474212122
http://c.tieba.baidu.com/p/3474188883
http://c.tieba.baidu.com/p/3474207722
http://c.tieba.baidu.com/p/3474184143
http://c.tieba.baidu.com/p/3474180522
http://c.tieba.baidu.com/p/3474171022
http://c.tieba.baidu.com/p/3474086627