nutch2.3爬虫抓取电影网站

上一篇文章介绍了nutch的安装

该文会简单的抓取网站 http://www.6vhao.com

1,打开目录nutch-2.3/runtime/local

2,mkdir urls

nano urls/url：添加链接

http://www.6vhao.com保存退出

3，在local目录下使用命令

./bin/nutch 会出现所有可以使用的命令

 inject         inject new urls into the database
 hostinject     creates or updates an existing host table from a text file
 generate       generate new batches to fetch from crawl db
 fetch          fetch URLs marked during generate
 parse          parse URLs marked during fetch
 updatedb       update web table after parsing
 updatehostdb   update host table after parsing
 readdb         read/dump records from page database
 readhostdb     display entries from the hostDB
 index          run the plugin-based indexer on parsed batches
 elasticindex   run the elasticsearch indexer - DEPRECATED use the index command instead
 solrindex      run the solr indexer on parsed batches - DEPRECATED use the index command instead
 solrdedup      remove duplicates from solr
 solrclean      remove HTTP 301 and 404 documents from solr - DEPRECATED use the clean command instead
 clean          remove HTTP 301 and 404 documents and duplicates from indexing backends configured via plugins
 parsechecker   check the parser for a given url
 indexchecker   check the indexing filters for a given url
 plugin         load a plugin and run one of its classes main()
 nutchserver    run a (local) Nutch server on a user defined port
 webapp         run a local Nutch web application
 junit          runs the given JUnit test
 or
 CLASSNAME      run the class named CLASSNAME

3,我们首先使用./bin/crawl 命令一站式抓取网页

4,爬取完成后进入hbase目录下

./bin/hbase shell 进入hbase shell，使用list可以看到当前表：data_webpage，nutch为其添加了后缀

5，hbase shell 中scan ‘data_webpage‘查看其内容,copy下样例数据

 tv.66ys.www:http/zy/               column=f:ts, timestamp=1446050113914, value=\x00\x00\x01P\xAFM\xA9s                                    
 tv.66ys.www:http/zy/               column=il:http://www.66ys.tv/, timestamp=1446050113914, value=\xE7\xBB\xBC\xE8\x89\xBA                 
 tv.66ys.www:http/zy/               column=mk:dist, timestamp=1446050113914, value=2                                                       
 tv.66ys.www:http/zy/               column=mtdt:_csh_, timestamp=1446050113914, value=\x00\x00\x00\x00                                     
 tv.66ys.www:http/zy/               column=s:s, timestamp=1446050113914, value=\x00\x00\x00\x00

更多内容下次再讲吧~~~~~~~~~~~~~~~

时间： 2024-08-08 17:49:30

nutch2.3爬虫抓取电影网站的相关文章

python爬虫——抓取电影天堂电影信息

做个小练习,抓取的是电影天堂里面最新电影的页面.链接地址:http://www.dytt8.net/html/gndy/dyzz/index.html 首先我们需要获取里面电影详情的网页地址: import urllib2 import os import re import string # 电影URL集合 movieUrls = [] # 获取电影列表 def queryMovieList(): url = 'http://www.dytt8.net/html/gndy/dyzz/index

Java豆瓣电影爬虫——抓取电影详情和电影短评数据

一直想做个这样的爬虫:定制自己的种子,爬取想要的数据,做点力所能及的小分析.正好,这段时间宝宝出生,一边陪宝宝和宝妈,一边把自己做的这个豆瓣电影爬虫的数据采集部分跑起来.现在做一个概要的介绍和演示. 动机采集豆瓣电影数据包括电影详情页数据和电影的短评数据. 电影详情页如下图所示需要保存这些详情字段如导演.编剧.演员等还有图中右下方的标签. 短评页面如下图所示需要保存的字段有短评所属的电影名称,每条评论的详细信息如评论人名称.评论内容等. 数据库设计有了如上的需求,需要设计表,其实很简单,

python爬虫抓取电影天堂最新电影

该小脚本实现对电影天堂网站的最新电影查找. from bs4 import BeautifulSoup import urllib import re url= 'http://www.ygdy8.net/html/gndy/dyzz/index.html' moive_url_list = [] moive_name_list = [] request = urllib.urlopen(url) response = request.read() response = unicode(r

nutch2.1抓取中文网站

对nutch添加中文网站抓取功能. 1.中文网页抓取 A.调整mysql配置,避免存入mysql的中文出现乱码.修改 ${APACHE_NUTCH_HOME} /runtime/local/conf/gora.properties ############################### # MySQL properties # ############################### gora.sqlstore.jdbc.driver=com.mysql.jd

Java爬虫实战（一）：抓取一个网站上的全部链接

前言:写这篇文章之前,主要是我看了几篇类似的爬虫写法,有的是用的队列来写,感觉不是很直观,还有的只有一个请求然后进行页面解析,根本就没有自动爬起来这也叫爬虫?因此我结合自己的思路写了一下简单的爬虫,测试用例就是自动抓取我的博客网站(http://www.zifangsky.cn)的所有链接. 一算法简介程序在思路上采用了广度优先算法,对未遍历过的链接逐次发起GET请求,然后对返回来的页面用正则表达式进行解析,取出其中未被发现的新链接,加入集合中,待下一次循环时遍历. 具体实现上使用了Map<

【转载】从网站内因分析影响爬虫抓取

一个网站只有爬虫抓取了,才能被收录.有快照.有排名.所以搜索引擎爬虫对网站的抓取情况直接决定了一个网站seo的成败.今天笔者就从网站内部本身来分析一下那些因素将影响到搜索引擎爬虫的爬取: 1.网站速度影响爬虫访问机房—DNS服务器—CDN—出口带宽--硬件—操作系统—服务器软件—程序机房的位置:最好选择靠近爬虫的线路 Dns服务器:热门.稳定(推荐dnspod) CDN:在做网站日志分析时候记得算进去出口带宽:避免与大流量的网站同机房硬件:在经济允许下使用配置高的操作系统:推荐linu

python爬虫抓取豆瓣电影

抓取电影名称以及评分,并排序(代码丑炸) 1 import urllib 2 import re 3 from bs4 import BeautifulSoup 4 def get(p): 5 t=0 6 k=1 7 n=1 8 book_score=[] 9 book_a=[] 10 while t<=p: 11 print "正在获取第%d页..."%k 12 k=k+1 13 url="https://movie.douban.com/tag/%s?start=%

python 爬虫抓取心得

quanwei9958 转自 python 爬虫抓取心得分享 urllib.quote('要编码的字符串') 如果你要在url请求里面放入中文,对相应的中文进行编码的话,可以用: urllib.quote('要编码的字符串') query = urllib.quote(singername) url = 'http://music.baidu.com/search?key='+query response = urllib.urlopen(url) text = response.read()

爬虫抓取网页相似度判断

爬虫抓取网页过程中,会产生很多的问题,当然最重要的一个问题就是重复问题,网页的重复抓取.最简单的方式就是对url去重.已经抓取过的url不再抓取.但是其实在实际业务中是需要对于已经抓取过的URL进行再次抓取的.例如 BBS .bbs存在大量的更新回复,但是url不会发生改变. 一般情况下的url去重方式,就是判断url是否抓取过,如果抓取过就不再抓取,或者是在一定时间内不再抓取.. 我的需求也是这样的, 所以首先做的就是url去重. 在爬虫发现链接,加入待抓取队列的时候,会对url进行验证,是否