【未完善】使用nutch命令逐步下载网页

此文未完善。是否可以使用nutch逐步下载,未知。

1、基本操作,构建环境

(1)下载软件安装包,并解压至/usr/search/apache-nutch-2.2.1/

(2)构建runtime

cd /usr/search/apache-nutch-2.2.1/

ant runtime

(3)验证Nutch安装完成

[[email protected] apache-nutch-2.2.1]# cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/

[[email protected] bin]# ./nutch

Usage: nutch COMMAND

where COMMAND is one of:

inject         inject new urls into the database

hostinject     creates or updates an existing host table from a text file

generate       generate new batches to fetch from crawl db

fetch          fetch URLs marked during generate

parse          parse URLs marked during fetch

updatedb       update web table after parsing

updatehostdb   update host table after parsing

readdb         read/dump records from page database

readhostdb     display entries from the hostDB

elasticindex   run the elasticsearch indexer

solrindex      run the solr indexer on parsed batches

solrdedup      remove duplicates from solr

parsechecker   check the parser for a given url

indexchecker   check the indexing filters for a given url

plugin         load a plugin and run one of its classes main()

nutchserver    run a (local) Nutch server on a user defined port

junit          runs the given JUnit test

or

CLASSNAME      run the class named CLASSNAME

Most commands print help when invoked w/o parameters.

(4)vi /usr/search/apache-nutch-2.2.1/runtime/local/conf/nutch-site.xml 添加搜索任务

<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>

(5)创建seed.txt

cd /usr/search/apache-nutch-2.2.1/runtime/local/bin/

vi seed.txt

http://nutch.apache.org/

(6)修改网页过滤器  vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt

vi /usr/search/apache-nutch-2.2.1/conf/regex-urlfilter.txt

# accept anything else

+.

修改为

# accept anything else

+^http://([a-z0-9]*\.)*nutch.apache.org/

When a user invokes a crawling command in Apache Nutch 1.x, CrawlDB is

generated by Apache Nutch which is nothing but a directory and which contains

details about crawling. In Apache 2.x, CrawlDB is not present. Instead, Apache

Nutch keeps all the crawling data directly in the database. In our case, we have used

Apache HBase, so all crawling data would go inside Apache HBase.

2 injectJob

[[email protected] local]# ./bin/nutch inject urls

InjectorJob: starting at 2014-07-07 14:15:21

InjectorJob: Injecting urlDir: urls

InjectorJob: Using class org.apache.gora.memory.store.MemStore as the Gora storage class.

InjectorJob: total number of urls rejected by filters: 0

InjectorJob: total number of urls injected after normalization and filtering: 2

Injector: finished at 2014-07-07 14:15:24, elapsed: 00:00:03

3  GenerateJob

[[email protected] local]# ./bin/nutch generate

Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]

-topN <N> - number of top URLs to be selected, default is Long.MAX_VALUE

-crawlId <id> - the id to prefix the schemas to operate on,

(default: storage.crawl.id)");

-noFilter - do not activate the filter plugin to filter the url, default is true

-noNorm - do not activate the normalizer plugin to normalize the url, default is true

-adddays - Adds numDays to the current time to facilitate crawling urls already

fetched sooner then db.fetch.interval.default. Default value is 0.

-batchId - the batch id

----------------------

Please set the params.

[[email protected] local]# ./bin/nutch generate -topN 3

GeneratorJob: starting at 2014-07-07 14:22:55

GeneratorJob: Selecting best-scoring urls due for fetch.

GeneratorJob: starting

GeneratorJob: filtering: true

GeneratorJob: normalizing: true

GeneratorJob: topN: 3

GeneratorJob: finished at 2014-07-07 14:22:58, time elapsed: 00:00:03

GeneratorJob: generated batch id: 1404714175-1017128204

4 FetcherJob

The job of the fetcher is to fetch the URLs which are generated by the GeneratorJob.

It will use the input provided by GeneratorJob. The following command will be

used for the FetcherJob:

[[email protected] local]# bin/nutch fetch –all

FetcherJob: starting

FetcherJob: batchId: –all

Fetcher: Your ‘http.agent.name‘ value should be listed first in ‘http.robots.agents‘ property.

FetcherJob: threads: 10

FetcherJob: parsing: false

FetcherJob: resuming: false

FetcherJob : timelimit set for : -1

Using queue mode : byHost

Fetcher: threads: 10

QueueFeeder finished: total 0 records. Hit by time limit :0

-finishing thread FetcherThread0, activeThreads=0

-finishing thread FetcherThread1, activeThreads=0

-finishing thread FetcherThread2, activeThreads=0

-finishing thread FetcherThread3, activeThreads=0

-finishing thread FetcherThread4, activeThreads=0

-finishing thread FetcherThread5, activeThreads=0

-finishing thread FetcherThread6, activeThreads=0

-finishing thread FetcherThread7, activeThreads=1

-finishing thread FetcherThread8, activeThreads=0

Fetcher: throughput threshold: -1

Fetcher: throughput threshold sequence: 5

-finishing thread FetcherThread9, activeThreads=0

0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues

-activeThreads=0

FetcherJob: done

Here I have provided input parameters—this means that this job will fetch all

the URLs that are generated by the GeneratorJob. You can use different input

parameters according to your needs.

5 ParserJob

After the FetcherJob, the ParserJob is to parse the URLs that are fetched by

FetcherJob. The following command will be used for the ParserJob:

[[email protected] local]# bin/nutch parse –all

ParserJob: starting

ParserJob: resuming: false

ParserJob: forced reparse: false

ParserJob: batchId: –all

ParserJob: success

[[email protected] local]#

I have used input parameters—all of which will parse all the URLs fetched by the

FetcherJob. You can use different input parameters according to your needs.

6 DbUpdaterJob

[[email protected] local]# ./bin/nutch updatedb

【未完善】使用nutch命令逐步下载网页

时间: 2024-10-10 09:45:22

【未完善】使用nutch命令逐步下载网页的相关文章

使用wget下载网页API的常用命令

先介绍几个参数:-c 断点续传(备注:使用断点续传要求服务器支持断点续传),-r 递归下载(目录下的所有文件,包括子目录),-np 递归下载不搜索上层目录,-k 把绝对链接转为相对链接,这样下载之后的网页方便浏览.-L 递归时不进入其他主机,-p 下载网页所需要的所有文件. 比如:#wget -c -r -np -k -L -p http://hbase.apache.org/0.94/apidocs/index.html

13.1.2 异步下载网页

在我们使用异步工作流来抓取网页内容之前,需要引用 FSharp.PowerPack.dll 库,它包含了许多 .NET 方法的异步版本.开发独立的应用程序时,可以使用添加引用命令:在这一章,我们将使用互动开发模式,因此,创建一个新的 F# 脚本文件,使用 #r 指令(清单  13.1). 清单13.1 使用异步工作流写代码 (F# Interactive) > #r "FSharp.PowerPack.dll";; > open System.IO open System.

GCD异步下载网页功能

最近发现MDT推出去的系统的有不同问题,其问题就不说了,主要是策略权限被域继承了.比如我们手动安装的很多东东都是未配置壮态,推的就默认为安全壮态了,今天细找了一下,原来把这个关了就可以了. GCD异步下载网页功能

分布式进阶 十 linux命令行下载文件以及常用工具 wget Prozilla MyGet Linuxdown Cu

linux命令行下载文件以及常用工具:wget.Prozilla.MyGet.Linuxdown.Curl.Axel 本文介绍常用的几种命令行式的下载工具:wget.Prozilla.MyGet.Linuxdown.Curl.Axel 下面就为大家详细介绍一下这些工具. 1. Wget Wget是一个十分常用命令行下载工具,多数Linux发行版本都默认包含这个工具.如果没有安装可在http://www.gnu.org/software/wget/wget.html 下载最新版本. 1.1 编译安

c# 下载网页图片

也是比较老的东西了 最近用到 记录下以免以后忘了 要下载图片首先要有图片地址 要有图片地址就要先把网页下下来分析下URL 下载网页一般用两种方法 1,用 system.net.webclient using System.Net; using System.Windows.Forms; string url = "http://www.cnblogs.com"; string result = null; try { WebClient client = new WebClient()

如何用 LaunchBar 一键下载网页上的所有文件?

本文标签: Mac效率工具 Mac小工具 MacOS LaunchBar 一键下载网页文件 有时候我们会遇到这种问题,一个页面上挂了好多文档需要下载: 依次点开再按 ? S 或是逐个右击再选择下载都不像是聪明的方法.我们可以通过 Automator 来实现这样的情况下文件的批量下载. 用 Automator 制作下载文件的工作流 启动 Automator,选择新建服务,依次加入如下六个操作模块: 首先先读取当前 Safari 所在页面的网页,接着获取这个网页下的所有网址,对它们进行过滤,过滤内容

Windows 和 Linux下使用socket下载网页页面内容(可设置接收/发送超时)的代码

主要难点在于设置recv()与send()的超时时间,具体要注意的事项,请看代码注释部分,下面是代码: [cpp] view plaincopyprint? #include <stdio.h> #include <sys/types.h> #include <stdlib.h> #include <string.h> #include <errno.h> #include <string.h> #ifdef _WIN32   ///

python之选课系统详解[功能未完善]

作业需求 思路:1.先写出大体的类,比如学校类,学生类,课程类--   2.写出类里面大概的方法,比如学校类里面有创建讲师.创建班级-- 3.根据下面写出大致的代码,并实现其功能       遇到的困难: 1.在类与类关联上卡住了,比如: 老师如何查看班级信息?? 老师有班级名称的属性, 而要查看班级信息,需要班级对象 那应该将老师的班级名称与班级对象相关联起来 那不同老师怎么办?? 所以学校创建 老师对象时,应该将老师名称与老师对象相关联起来 通过输入老师名称即可找到老师对象 2. 想把讲师对

Java -&gt; 把Excel表格中的数据写入数据库与从数据库中读出到本地 (未完善)

写入: private void insertFile(HttpServletRequest request, HttpServletResponse response) throws IOException { String path_member = request.getParameter("path_member"); List list = this.insert("f:/tmp001.xls", "gs_sale_members");