这个框架关注了很久,但是直到最近空了才仔细的看了下 这里我用的是scrapy0.24版本
先来个成品好感受这个框架带来的便捷性,等这段时间慢慢整理下思绪再把最近学到的关于此框架的知识一一更新到博客来。
先说明下这个玩具爬虫的目的
能够将种子URL页面当中的小组进行爬取 并分析出有关联的小组连接 以及小组的组员人数 和组名等信息
出来的数据大概是这样的
{ ‘RelativeGroups‘: [u‘http://www.douban.com/group/10127/‘, u‘http://www.douban.com/group/seventy/‘, u‘http://www.douban.com/group/lovemuseum/‘, u‘http://www.douban.com/group/486087/‘, u‘http://www.douban.com/group/lovesh/‘, u‘http://www.douban.com/group/NoAstrology/‘, u‘http://www.douban.com/group/shanghaijianzhi/‘, u‘http://www.douban.com/group/12658/‘, u‘http://www.douban.com/group/shanghaizufang/‘, u‘http://www.douban.com/group/gogo/‘, u‘http://www.douban.com/group/117546/‘, u‘http://www.douban.com/group/159755/‘], ‘groupName‘: u‘\u4e0a\u6d77\u8c46\u74e3‘, ‘groupURL‘: ‘http://www.douban.com/group/Shanghai/‘, ‘totalNumber‘: u‘209957‘}
有啥用 其实这些数据就能够分析小组与小组之间的关联度等,如果有心还能抓取到更多的信息。不在此展开 本文章主要是为了能够快速感受一把。
首先就是 start 一个新的名为douban的项目
# scrapy startproject douban
# cd douban
这是整个项目的完整后的目录 [email protected]:~/student/py/douban$ tree . ├── douban │ ├── __init__.py │ ├── __init__.pyc │ ├── items.py │ ├── items.pyc │ ├── pipelines.py │ ├── pipelines.pyc │ ├── settings.py │ ├── settings.pyc │ └── spiders │ ├── BasicGroupSpider.py │ ├── BasicGroupSpider.pyc │ ├── __init__.py │ └── __init__.pyc ├── nohup.out ├── scrapy.cfg ├── start.sh ├── stop.sh └── test.log
编写实体 items.py , 主要是为了抓回来的数据可以很方便的持久化
[email protected]:~/student/py/douban$ cat douban/items.py # -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html from scrapy.item import Item, Field class DoubanItem(Item): # define the fields for your item here like: # name = Field() groupName = Field() groupURL = Field() totalNumber = Field() RelativeGroups = Field() ActiveUesrs = Field()
编写爬虫并自定义一些规则进行数据的处理
[email protected]:~/student/py/douban$ cat douban/spiders/BasicGroupSpider.py # -*- coding: utf-8 -*- from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.item import Item from douban.items import DoubanItem import re class GroupSpider(CrawlSpider): # 爬虫名 name = "Group" allowed_domains = ["douban.com"] # 种子链接 start_urls = [ "http://www.douban.com/group/explore?tag=%E8%B4%AD%E7%89%A9", "http://www.douban.com/group/explore?tag=%E7%94%9F%E6%B4%BB", "http://www.douban.com/group/explore?tag=%E7%A4%BE%E4%BC%9A", "http://www.douban.com/group/explore?tag=%E8%89%BA%E6%9C%AF", "http://www.douban.com/group/explore?tag=%E5%AD%A6%E6%9C%AF", "http://www.douban.com/group/explore?tag=%E6%83%85%E6%84%9F", "http://www.douban.com/group/explore?tag=%E9%97%B2%E8%81%8A", "http://www.douban.com/group/explore?tag=%E5%85%B4%E8%B6%A3" ] # 规则 满足后 使用callback指定的函数进行处理 rules = [ Rule(SgmlLinkExtractor(allow=(‘/group/[^/]+/$‘, )), callback=‘parse_group_home_page‘, process_request=‘add_cookie‘), Rule(SgmlLinkExtractor(allow=(‘/group/explore\?tag‘, )), follow=True, process_request=‘add_cookie‘), ] def __get_id_from_group_url(self, url): m = re.search("^http://www.douban.com/group/([^/]+)/$", url) if(m): return m.group(1) else: return 0 def add_cookie(self, request): request.replace(cookies=[ ]); return request; def parse_group_topic_list(self, response): self.log("Fetch group topic list page: %s" % response.url) pass def parse_group_home_page(self, response): self.log("Fetch group home page: %s" % response.url) # 这里使用的是一个叫 XPath 的选择器 hxs = HtmlXPathSelector(response) item = DoubanItem() #get group name item[‘groupName‘] = hxs.select(‘//h1/text()‘).re("^\s+(.*)\s+$")[0] #get group id item[‘groupURL‘] = response.url groupid = self.__get_id_from_group_url(response.url) #get group members number members_url = "http://www.douban.com/group/%s/members" % groupid members_text = hxs.select(‘//a[contains(@href, "%s")]/text()‘ % members_url).re("\((\d+)\)") item[‘totalNumber‘] = members_text[0] #get relative groups item[‘RelativeGroups‘] = [] groups = hxs.select(‘//div[contains(@class, "group-list-item")]‘) for group in groups: url = group.select(‘div[contains(@class, "title")]/a/@href‘).extract()[0] item[‘RelativeGroups‘].append(url) return item
编写数据处理的管道这个阶段我会把爬虫收集到的数据存储到mongodb当中去
[email protected]:~/student/py/douban$ cat douban/pipelines.py # -*- coding: utf-8 -*- # Define your item pipelines here # # Don‘t forget to add your pipeline to the ITEM_PIPELINES setting # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html import pymongo from scrapy import log from scrapy.conf import settings from scrapy.exceptions import DropItem class DoubanPipeline(object): def __init__(self): self.server = settings[‘MONGODB_SERVER‘] self.port = settings[‘MONGODB_PORT‘] self.db = settings[‘MONGODB_DB‘] self.col = settings[‘MONGODB_COLLECTION‘] connection = pymongo.Connection(self.server, self.port) db = connection[self.db] self.collection = db[self.col] def process_item(self, item, spider): self.collection.insert(dict(item)) log.msg(‘Item written to MongoDB database %s/%s‘ % (self.db, self.col),level=log.DEBUG, spider=spider) return item
在设置类中设置 所使用的数据处理管道 以及mongodb连接参数 和 user-agent 躲避爬虫被禁
[email protected]:~/student/py/douban$ cat douban/settings.py # -*- coding: utf-8 -*- # Scrapy settings for douban project # # For simplicity, this file contains only the most important settings by # default. All the other settings are documented here: # # http://doc.scrapy.org/en/latest/topics/settings.html # BOT_NAME = ‘douban‘ SPIDER_MODULES = [‘douban.spiders‘] NEWSPIDER_MODULE = ‘douban.spiders‘ # 设置等待时间缓解服务器压力 并能够隐藏自己 DOWNLOAD_DELAY = 2 RANDOMIZE_DOWNLOAD_DELAY = True USER_AGENT = ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5‘ COOKIES_ENABLED = True # 配置使用的数据管道 ITEM_PIPELINES = [‘douban.pipelines.DoubanPipeline‘] MONGODB_SERVER=‘localhost‘ MONGODB_PORT=27017 MONGODB_DB=‘douban‘ MONGODB_COLLECTION=‘doubanGroup‘ # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = ‘douban (+http://www.yourdomain.com)‘
OK 一个玩具爬虫就简单的完成了
启动启动命令
nohup scrapy crawl Group --logfile=test.log &
时间: 2024-11-04 13:41:08