1.安装:要是说到爬虫,我们不得不提一个大而全的爬虫组件/框架,这个框架就是scrapy:scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中。那么我们直接进入正题,先说说这个框架的两种安装方式:
第一种:windows环境下的安装需要以下几步操作
1.下载twisted:http://www.lfd.uci.edu/~gohlke/pythonlibs/ 2.pip3 install wheel 3.pip3 install Twisted?18.4.0?cp36?cp36m?win_amd64.whl #根据自己的版本找对应的版本进行安装 4.pip3 install pywin32 5.pip3 install scrapy
第二种:在linux系统下安装,苹果的mac下的安装方式也是一样
pip3 install scrapy
2.scrapy的基本使用:Django与Scrapy的使用对比
Django:
# 创建django project django-admin startproject mysite cd mysite # 创建app python manage.py startapp app01 python manage.py startapp app02 # 启动项目 python manage.runserver
Scrapy:
# 创建scrapy project scrapy startproject cjk cd cjk # 创建爬虫 scrapy genspider chouti chouti.com scrapy genspider cnblogs cnblogs.com # 启动爬虫 scrapy crawl chouti
安装好scrapy之后进入cmd命令行查看是否安装成功:scrapy,如果见到如下的提示就代表安装好了
1 Last login: Sat Jan 5 18:14:13 on ttys000 2 chenjunkandeMBP:~ chenjunkan$ scrapy 3 Scrapy 1.5.1 - no active project 4 5 Usage: 6 scrapy <command> [options] [args] 7 8 Available commands: 9 bench Run quick benchmark test 10 fetch Fetch a URL using the Scrapy downloader 11 genspider Generate new spider using pre-defined templates 12 runspider Run a self-contained spider (without creating a project) 13 settings Get settings values 14 shell Interactive scraping console 15 startproject Create new project 16 version Print Scrapy version 17 view Open URL in browser, as seen by Scrapy 18 19 [ more ] More commands available when run from project directory 20 21 Use "scrapy <command> -h" to see more info about a command
scrapy项目目录的查看:
创建project scrapy startproject 项目名称 项目名称 项目名称/ - spiders # 爬虫文件 - chouti.py - cnblgos.py - items.py # 持久化 - pipelines # 持久化 - middlewares.py # 中间件 - settings.py # 配置文件(爬虫) scrapy.cfg # 配置文件(部署)
如何启动爬虫?我们来看下面这个简单的例子:
# -*- coding: utf-8 -*- import scrapy class ChoutiSpider(scrapy.Spider): # 爬虫的名称 name = ‘chouti‘ # 定向爬虫,只能爬取域名是gig.chouti.com的网站才能爬取 allowed_domains = [‘dig.chouti.com‘] # 起始的url start_urls = [‘http://dig.chouti.com/‘] # 回调函数,起始的url执行结束之后就会自动去调这个函数 def parse(self, response): # <200 https://dig.chouti.com/> <class ‘scrapy.http.response.html.HtmlResponse‘> print(response, type(response))
通过看源码得知from scrapy.http.response.html import HtmlResponse这个里面的HtmlResponse这个类是继承TextResponse,TextResponse同时又继承了Response,因此HtmlResponse也就具有了父类里面的方法(text,xpath)等
说明:上面这个举例简单的打印了返回的response里面的信息以及response的类型,但是在这里面我们需要注意几点:
- 对于chouti这个网站来说设置了反爬虫的协议,因此我们在setting里面需要将ROBOTSTXT_OBEY这个配置改为Flase,默认为True,如果改为Flase之后那么就不会遵循爬虫协议
- 对于回调函数返回的response其实是HtmlResponse这个类的对象,response是一个封装了所有请求的响应信息的对象
- 执行scrapy crawl chouti与scrapy crawl chouti --log的区别?前面是打印出来日志,后面是不打印出来日志
3.scrapy基本原理了解(后面会详细的介绍)
scrapy是一个基于事件循环的异步非阻塞的框架:基于twisted实现的一个框架,内部是基于事件循环的机制实现爬虫的并发
原来的我是这样实现同时爬取多个url的:将请求一个一个发出去
url_list = [‘http://www.baidu.com‘,‘http://www.baidu.com‘,‘http://www.baidu.com‘,] for item in url_list: response = requests.get(item) print(response.text)
现在的我可以通过这种方式来实现:
from twisted.web.client import getPage, defer from twisted.internet import reactor # 第一部分:代理开始接收任务 def callback(contents): print(contents) deferred_list = [] # 我需要请求的列表 url_list = [‘http://www.bing.com‘, ‘https://segmentfault.com/‘, ‘https://stackoverflow.com/‘] for url in url_list: # 不会马上就将请求发出去,而是生成一个对象 deferred = getPage(bytes(url, encoding=‘utf8‘)) # 请求结束调用的回调函数 deferred.addCallback(callback) # 将所有的对象放入一个列表里面 deferred_list.append(deferred) # 第二部分:代理执行完任务后,停止 dlist = defer.DeferredList(deferred_list) def all_done(arg): reactor.stop() dlist.addBoth(all_done) # 第三部分:代理开始去处理吧 reactor.run()
4.持久化
我们传统的持久化一般来说都是将爬到的数据写到文件里面,这样做有很多缺点:比如
- 无法完成爬虫刚开始:打开连接; 爬虫关闭时:关闭连接;
- 分工不明确
传统的持久化写文件:
1 # -*- coding: utf-8 -*- 2 import scrapy 3 4 5 class ChoutiSpider(scrapy.Spider): 6 # 爬虫的名称 7 name = ‘chouti‘ 8 # 定向爬虫,只能爬取域名是gig.chouti.com的网站才能爬取 9 allowed_domains = [‘dig.chouti.com‘] 10 # 起始的url 11 start_urls = [‘http://dig.chouti.com/‘] 12 13 # 回调函数,起始的url执行结束之后就会自动去调这个函数 14 def parse(self, response): 15 f = open(‘news.log‘, mode=‘a+‘) 16 item_list = response.xpath(‘//div[@id="content-list"]/div[@class="item"]‘) 17 for item in item_list: 18 text = item.xpath(‘.//a/text()‘).extract_first() 19 href = item.xpath(‘.//a/@href‘).extract_first() 20 print(href, text.strip()) 21 f.write(href + ‘\n‘) 22 f.close()
这个时候scrapy里面的两个重要的模块就要上场了:分别是 Item与Pipelines
用scrapy实现pipeline步骤:
a.在setting配置文件中配置ITEM_PIPELINES:在这里可以写多个,后面的数字代表优先级,数字越小代表越优先
ITEM_PIPELINES = { ‘cjk.pipelines.CjkPipeline‘: 300, }
b.当我们配置好了ITEM_PIPELINES之后其实我们的pipelines.py文件里面的def process_item(self, item, spider)这个方法会被自动的触发,但是我们仅仅是修改这个我们会发现运行爬虫的时候并没有打印出我们想要的信息chenjunkan;
class CjkscrapyPipeline(object): def process_item(self, item, spider): print("chenjunkan") return item
c.那么我们先要在Item.py这个文件里面新增两个字段,用来约束只需要传这两个字段就行
import scrapy class CjkscrapyItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() href = scrapy.Field() title = scrapy.Field()
然后在爬虫里面新增:yield CjkscrapyItem(href=href, title=text),实例化CjkItem这个类生成一个对象,给这个对象传入两个参数,CjkItem这个类里面的两个参数就是用来接收这两个参数的
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from cjkscrapy.items import CjkscrapyItem 4 5 6 class ChoutiSpider(scrapy.Spider): 7 # 爬虫的名称 8 name = ‘chouti‘ 9 # 定向爬虫,只能爬取域名是gig.chouti.com的网站才能爬取 10 allowed_domains = [‘dig.chouti.com‘] 11 # 起始的url 12 start_urls = [‘http://dig.chouti.com/‘] 13 14 # 回调函数,起始的url执行结束之后就会自动去调这个函数 15 def parse(self, response): 16 item_list = response.xpath(‘//div[@id="content-list"]/div[@class="item"]‘) 17 for item in item_list: 18 text = item.xpath(‘.//a/text()‘).extract_first() 19 href = item.xpath(‘.//a/@href‘).extract_first() 20 yield CjkscrapyItem(href=href, title=text)
d.接下来pipelines里面的process_item这个方法就会被自动的触发,每yeild一次,process_item这个方法就被触发一次;这个里面的item就是我们通过CjkItem创建的对象;那么为什么我们需要用CjkItem,因为我们可以通过这个类来给我们CjkPipeline约束要取什么数据进行持久化,item给我们约定了什么字段我们就去取什么字段;那么def process_item(self, item, spider)这个方法里面的spider值得是什么呢?这个其实就是实例化的爬虫,因为class ChoutiSpider(scrapy.Spider):这个爬虫要想执行的话首先必须先实例化,说白了就是我们当前爬虫的对象,这个对象里面有name,allowed_domains等一些属性
e.因此我们一个爬虫当yield CjkItem的时候就是交给我们的pipelines去处理,如果我们yield Request那么就是重新去下载
总结:
1 a.先写pipeline类 2 3 4 class XXXPipeline(object): 5 def process_item(self, item, spider): 6 return item 7 8 9 b.写Item类 10 11 12 class XdbItem(scrapy.Item): 13 href = scrapy.Field() 14 title = scrapy.Field() 15 16 17 c.配置 18 ITEM_PIPELINES = { 19 ‘xdb.pipelines.XdbPipeline‘: 300, 20 } 21 22 d.爬虫,yield每执行一次,process_item就调用一次。 23 24 yield Item对象
现在我们有了初步的了解持久化了,但是我们还是存在点问题,就是每yeild一次就会触发process_item这个方法一次,假设我们在process_item里面要不断的打开和关闭连接,这就对性能上影响比较大了。示例:
def process_item(self, item, spider): f = open("xx.log", "a+") f.write(item["href"]+"\n") f.close() return item
那么在CjkscrapyPipeline这个类里面其实还有两个方法open_spider和close_spider,这样我们就可以将打开连接的操作放在open_spider里面,将关闭连接的操作放在close_spider里面,避免重复的去操作打开关闭:
class CjkscrapyPipeline(object): def open_spider(self, spider): print("爬虫开始了") self.f = open("new.log", "a+") def process_item(self, item, spider): self.f.write(item["href"] + "\n") return item def close_spider(self, spider): self.f.close() print("爬虫结束了")
我们上面确实是完成了功能,但是仔细看看我们上面的这种写法是不是有点不是很规范,因此:
class CjkscrapyPipeline(object): def __init__(self): self.f = None def open_spider(self, spider): print("爬虫开始了") self.f = open("new.log", "a+") def process_item(self, item, spider): self.f.write(item["href"] + "\n") return item def close_spider(self, spider): self.f.close() print("爬虫结束了")
再看上面的代码,我么会发现我们将写文件的目录放在了程序中,那么我们能不能在配置文件中去配置呢?所以在CjkscrapyPipeline这个类里面有一个方法:from_crawler,这是一个类方法
@classmethod def from_crawler(cls, crawler): print(‘File.from_crawler‘) path = crawler.settings.get(‘HREF_FILE_PATH‘) return cls(path)
说明:crawler.settings.get(‘HREF_FILE_PATH‘)代表去所有的配置文件中去找HREF_FILE_PATH,这个方法里面的cls指的是当前类CjkscrapyPipeline,返回的cls(path在初始化的时候传入path):
class CjkscrapyPipeline(object): def __init__(self, path): self.f = None self.path = path @classmethod def from_crawler(cls, crawler): print(‘File.from_crawler‘) path = crawler.settings.get(‘HREF_FILE_PATH‘) return cls(path) def open_spider(self, spider): print("爬虫开始了") self.f = open(self.path, "a+") def process_item(self, item, spider): self.f.write(item["href"] + "\n") return item def close_spider(self, spider): self.f.close() print("爬虫结束了")
那么现在我们知道在CjkPipeline这个类里面有5个方法,他们的执行顺序是什么样的呢?
""" 源码内容: 1. 判断当前CjkPipeline类中是否有from_crawler 有: obj = CjkPipeline.from_crawler(....) 否: obj = CjkPipeline() 2. obj.open_spider() 3. obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item() 4. obj.close_spider() """
说明:首先先会判断是否有from_crawler这个方法,如果没有的话,那么就会直接实例化CjkscrapyPipeline这个类生成对对象,如果有的话,那么先执行from_crawler这个方法,他会调用settting文件,因为这个方法的返回值是实例化这个类的对象,实例化的时候需要调用__init__方法
在pipelines里面有一个return item,这个是干嘛用的呢?
from scrapy.exceptions import DropItem
# return item # 交给下一个pipeline的process_item方法
# raise DropItem()# 后续的 pipeline的process_item方法不再执行
注意:pipeline是所有爬虫公用,如果想要给某个爬虫定制需要使用spider参数自己进行处理。
# if spider.name == ‘chouti‘:
5.去重规则
我拿上面之前的代码举例:
# -*- coding: utf-8 -*- import scrapy from cjk.items import CjkItem class ChoutiSpider(scrapy.Spider): name = ‘chouti‘ allowed_domains = [‘dig.chouti.com‘] start_urls = [‘http://dig.chouti.com/‘] def parse(self, response): print(response.request.url) # item_list = response.xpath(‘//div[@id="content-list"]/div[@class="item"]‘) # for item in item_list: # text = item.xpath(‘.//a/text()‘).extract_first() # href = item.xpath(‘.//a/@href‘).extract_first() page_list = response.xpath(‘//div[@id="dig_lcpage"]//a/@href‘).extract() for page in page_list: from scrapy.http import Request page = "https://dig.chouti.com" + page yield Request(url=page, callback=self.parse) # https://dig.chouti.com/all/hot/recent/2
说明:print(response.request.url)这个打印出来的是当前执行的url,当我们执行上面的代码的时候,可以发现从1~120页没有重复的,其实是scrapy内部自己做了去重的,也就是相当于在内存里面设置了一个集合,因为集合是不可重复的
查看源码:
导入:from scrapy.dupefilter import RFPDupeFilter
RFPDupeFilter这是一个类:
1 class RFPDupeFilter(BaseDupeFilter): 2 """Request Fingerprint duplicates filter""" 3 4 def __init__(self, path=None, debug=False): 5 self.file = None 6 self.fingerprints = set() 7 self.logdupes = True 8 self.debug = debug 9 self.logger = logging.getLogger(__name__) 10 if path: 11 self.file = open(os.path.join(path, ‘requests.seen‘), ‘a+‘) 12 self.file.seek(0) 13 self.fingerprints.update(x.rstrip() for x in self.file) 14 15 @classmethod 16 def from_settings(cls, settings): 17 debug = settings.getbool(‘DUPEFILTER_DEBUG‘) 18 return cls(job_dir(settings), debug) 19 20 def request_seen(self, request): 21 fp = self.request_fingerprint(request) 22 if fp in self.fingerprints: 23 return True 24 self.fingerprints.add(fp) 25 if self.file: 26 self.file.write(fp + os.linesep) 27 28 def request_fingerprint(self, request): 29 return request_fingerprint(request) 30 31 def close(self, reason): 32 if self.file: 33 self.file.close() 34 35 def log(self, request, spider): 36 if self.debug: 37 msg = "Filtered duplicate request: %(request)s" 38 self.logger.debug(msg, {‘request‘: request}, extra={‘spider‘: spider}) 39 elif self.logdupes: 40 msg = ("Filtered duplicate request: %(request)s" 41 " - no more duplicates will be shown" 42 " (see DUPEFILTER_DEBUG to show all duplicates)") 43 self.logger.debug(msg, {‘request‘: request}, extra={‘spider‘: spider}) 44 self.logdupes = False 45 46 spider.crawler.stats.inc_value(‘dupefilter/filtered‘, spider=spider)
在这个类里面有一个非常重要的方法:request_seen,这个方法直接确定了是否已经访问过;当我们在yeild Request的时候在内部先会调用一下request_seen这个方法,判断是否已经访问过了:单独来看这个方法
def request_seen(self, request): fp = self.request_fingerprint(request) if fp in self.fingerprints: return True self.fingerprints.add(fp) if self.file: self.file.write(fp + os.linesep)
说明:a.首先执行fp = self.request_fingerprint(request),这个里面的request就是我们传进来的request,request里面应该有url,因为每个url的长短不一,因此我们可以将fp看成是这个url被request_fingerprint这个方法封装过得一个md5值,这样每个url的长度是一样的,便于操作;
b.执行if fp in self.fingerprints: self.fingerprints在__init__初始化的时候其实是一个集合:self.fingerprints = set();如果fp在这个集合里面那么返回的值是true,如果返回的是true,那么就代表这个url被访问过了,那么再次访问的时候就不在访问了
c.如果没有被访问过就执行self.fingerprints.add(fp),将fp加入到这个集合里面
d.我们将访问过得放在内存里面,但是也可放在文件里面执行:if self.file: self.file.write(fp + os.linesep)
如何写到文件中呢?
1 1.首先执行from_setting去配置文件找路径,要在配置文件里面将debug设置为True; 2 @classmethod 3 def from_settings(cls, settings): 4 debug = settings.getbool(‘DUPEFILTER_DEBUG‘) 5 return cls(job_dir(settings), debug) 6 2.返回的是return cls(job_dir(settings), debug), 7 import os 8 9 def job_dir(settings): 10 path = settings[‘JOBDIR‘] 11 if path and not os.path.exists(path): 12 os.makedirs(path) 13 return path 14 如果有path那就从setting里面去取,设置JOBDIR,如果没有的话那么创建,最后返回的是path,也就是配置文件里面的JOBDIR对应的值
但是我们一般的话写在文件里面的话意义不是很大,我后面会介绍将其写在redis里面
自定义去重规则:因为内置的去重规则RFPDupeFilter这个类是继承BaseDupeFilter的,因此我们自定义的时候也可以继承这个基类
from scrapy.dupefilter import BaseDupeFilter from scrapy.utils.request import request_fingerprint class CjkDupeFilter(BaseDupeFilter): def __init__(self): self.visited_fd = set() @classmethod def from_settings(cls, settings): return cls() def request_seen(self, request): fd = request_fingerprint(request=request) if fd in self.visited_fd: return True self.visited_fd.add(fd) def open(self): # can return deferred print(‘开始‘) def close(self, reason): # can return a deferred print(‘结束‘) # def log(self, request, spider): # log that a request has been filtered # print(‘日志‘)
系统默认的去重规则配置:DUPEFILTER_CLASS = ‘scrapy.dupefilter.RFPDupeFilter‘
我们需要在setting配置文件里面将我们自定义的去重规则写上:DUPEFILTER_CLASS = ‘cjkscrapy.dupefilters.CjkDupeFilter‘
eg:后面我们会在redis里面做去重规则
注意:在我们的爬虫文件也可以自己设置去重规则,在Request里面有个dont_filter这个参数,默认为False表示遵循过滤规则,如果将dont_filter这个参数设置为True那么表示不遵循去重规则
class Request(object_ref): def __init__(self, url, callback=None, method=‘GET‘, headers=None, body=None, cookies=None, meta=None, encoding=‘utf-8‘, priority=0, dont_filter=False, errback=None, flags=None):
6.depth深度控制(后面会详细说明)
# -*- coding: utf-8 -*- import scrapy from scrapy.dupefilter import RFPDupeFilter class ChoutiSpider(scrapy.Spider): name = ‘chouti‘ allowed_domains = [‘dig.chouti.com‘] start_urls = [‘http://dig.chouti.com/‘] def parse(self, response): print(response.request.url, response.meta.get("depth", 0)) # item_list = response.xpath(‘//div[@id="content-list"]/div[@class="item"]‘) # for item in item_list: # text = item.xpath(‘.//a/text()‘).extract_first() # href = item.xpath(‘.//a/@href‘).extract_first() page_list = response.xpath(‘//div[@id="dig_lcpage"]//a/@href‘).extract() for page in page_list: from scrapy.http import Request page = "https://dig.chouti.com" + page yield Request(url=page, callback=self.parse, dont_filter=False) # https://dig.chouti.com/all/hot/recent/2
深度指的是爬虫爬取的深度,如果我们想控制深度,那么我们可以在setting配置文件里面设置DEPTH_LIMIT = 3
7.手动处理cookie
Request对象里面参数查看:
class Request(object_ref): def __init__(self, url, callback=None, method=‘GET‘, headers=None, body=None, cookies=None, meta=None, encoding=‘utf-8‘, priority=0, dont_filter=False, errback=None, flags=None):
我们暂时先关注Request里面的这个参数url, callback=None, method=‘GET‘, headers=None, body=None,cookies=None;发请求其实就相当于请求头+请求体+cookie
自动登录抽屉示例:
我们第一次请求抽屉的时候我们会得到一个cookie,我们如果去响应里面去拿到cookie呢?
首先导入一个处理cookie的类:from scrapy.http.cookies import CookieJar
# -*- coding: utf-8 -*- import scrapy from scrapy.dupefilter import RFPDupeFilter from scrapy.http.cookies import CookieJar class ChoutiSpider(scrapy.Spider): name = ‘chouti‘ allowed_domains = [‘dig.chouti.com‘] start_urls = [‘http://dig.chouti.com/‘] def parse(self, response): cookie_dict = {} # 去响应头中获取cookie,cookie保存在cookie_jar对象 cookie_jar = CookieJar() # 去response, response.request里面拿到cookie cookie_jar.extract_cookies(response, response.request) # 去对象中将cookie解析到字典 for k, v in cookie_jar._cookies.items(): for i, j in v.items(): for m, n in j.items(): cookie_dict[m] = n.value print(cookie_dict)
运行上面的代码:我们能拿到cookie
chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog /Users/chenjunkan/Desktop/scrapytest/cjkscrapy/cjkscrapy/spiders/chouti.py:3: ScrapyDeprecationWarning: Module `scrapy.dupefilter` is deprecated, use `scrapy.dupefilters` instead from scrapy.dupefilter import RFPDupeFilter {‘gpsd‘: ‘d052f4974404d8c431f3c7c1615694c4‘, ‘JSESSIONID‘: ‘aaaUxbbxMYOWh4T7S7rGw‘}
自动登录抽屉并且点赞的完整示例:
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from scrapy.dupefilter import RFPDupeFilter 4 from scrapy.http.cookies import CookieJar 5 from scrapy.http import Request 6 7 8 class ChoutiSpider(scrapy.Spider): 9 name = ‘chouti‘ 10 allowed_domains = [‘dig.chouti.com‘] 11 start_urls = [‘http://dig.chouti.com/‘] 12 cookie_dict = {} 13 14 def parse(self, response): 15 16 # 去响应头中获取cookie,cookie保存在cookie_jar对象 17 cookie_jar = CookieJar() 18 # 去response, response.request里面拿到cookie 19 cookie_jar.extract_cookies(response, response.request) 20 21 # 去对象中将cookie解析到字典 22 for k, v in cookie_jar._cookies.items(): 23 for i, j in v.items(): 24 for m, n in j.items(): 25 self.cookie_dict[m] = n.value 26 print(self.cookie_dict) 27 28 yield Request( 29 url=‘https://dig.chouti.com/login‘, 30 method=‘POST‘, 31 body="phone=8618357186730&password=cjk139511&oneMonth=1", 32 cookies=self.cookie_dict, 33 headers={ 34 ‘Content-Type‘: ‘application/x-www-form-urlencoded; charset=UTF-8‘ 35 }, 36 callback=self.check_login 37 ) 38 39 def check_login(self, response): 40 print(response.text) 41 yield Request( 42 url=‘https://dig.chouti.com/all/hot/recent/1‘, 43 cookies=self.cookie_dict, 44 callback=self.index 45 ) 46 47 def index(self, response): 48 news_list = response.xpath(‘//div[@id="content-list"]/div[@class="item"]‘) 49 for new in news_list: 50 link_id = new.xpath(‘.//div[@class="part2"]/@share-linkid‘).extract_first() 51 yield Request( 52 url=‘http://dig.chouti.com/link/vote?linksId=%s‘ % (link_id,), 53 method=‘POST‘, 54 cookies=self.cookie_dict, 55 callback=self.check_result 56 ) 57 # 所有的页面全部点赞 58 page_list = response.xpath(‘//div[@id="dig_lcpage"]//a/@href‘).extract() 59 for page in page_list: 60 page = "https://dig.chouti.com" + page 61 yield Request(url=page, callback=self.index) # https://dig.chouti.com/all/hot/recent/2 62 63 def check_result(self, response): 64 print(response.text)
运行结果:
1 chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog 2 /Users/chenjunkan/Desktop/scrapytest/cjkscrapy/cjkscrapy/spiders/chouti.py:3: ScrapyDeprecationWarning: Module `scrapy.dupefilter` is deprecated, use `scrapy.dupefilters` instead 3 from scrapy.dupefilter import RFPDupeFilter 4 {‘gpsd‘: ‘78613f08c985435d5d0eedc08b0ed812‘, ‘JSESSIONID‘: ‘aaaTmxnFAJJGn9Muf8rGw‘} 5 {"result":{"code":"9999", "message":"", "data":{"complateReg":"0","destJid":"cdu_53587312848"}}} 6 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112533000","lvCount":"6","nick":"chenjunkan","uvCount":"26","voteTime":"小于1分钟前"}}} 7 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112720000","lvCount":"28","nick":"chenjunkan","uvCount":"27","voteTime":"小于1分钟前"}}} 8 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112862000","lvCount":"24","nick":"chenjunkan","uvCount":"33","voteTime":"小于1分钟前"}}} 9 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112849000","lvCount":"29","nick":"chenjunkan","uvCount":"33","voteTime":"小于1分钟前"}}} 10 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112872000","lvCount":"48","nick":"chenjunkan","uvCount":"33","voteTime":"小于1分钟前"}}} 11 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112877000","lvCount":"23","nick":"chenjunkan","uvCount":"33","voteTime":"小于1分钟前"}}} 12 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112877000","lvCount":"69","nick":"chenjunkan","uvCount":"33","voteTime":"小于1分钟前"}}} 13 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112877000","lvCount":"189","nick":"chenjunkan","uvCount":"33","voteTime":"小于1分钟前"}}} 14 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112926000","lvCount":"98","nick":"chenjunkan","uvCount":"35","voteTime":"小于1分钟前"}}} 15 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112951000","lvCount":"61","nick":"chenjunkan","uvCount":"35","voteTime":"小于1分钟前"}}} 16 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113086000","lvCount":"13","nick":"chenjunkan","uvCount":"37","voteTime":"小于1分钟前"}}} 17 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113097000","lvCount":"17","nick":"chenjunkan","uvCount":"38","voteTime":"小于1分钟前"}}} 18 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113118000","lvCount":"21","nick":"chenjunkan","uvCount":"41","voteTime":"小于1分钟前"}}} 19 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113155000","lvCount":"86","nick":"chenjunkan","uvCount":"41","voteTime":"小于1分钟前"}}} 20 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113140000","lvCount":"22","nick":"chenjunkan","uvCount":"41","voteTime":"小于1分钟前"}}} 21 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113148000","lvCount":"25","nick":"chenjunkan","uvCount":"41","voteTime":"小于1分钟前"}}} 22 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113416000","lvCount":"22","nick":"chenjunkan","uvCount":"47","voteTime":"小于1分钟前"}}} 23 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113386000","lvCount":"13","nick":"chenjunkan","uvCount":"46","voteTime":"小于1分钟前"}}} 24 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113381000","lvCount":"70","nick":"chenjunkan","uvCount":"46","voteTime":"小于1分钟前"}}} 25 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113408000","lvCount":"27","nick":"chenjunkan","uvCount":"47","voteTime":"小于1分钟前"}}} 26 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113402000","lvCount":"17","nick":"chenjunkan","uvCount":"47","voteTime":"小于1分钟前"}}} 27 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113428000","lvCount":"41","nick":"chenjunkan","uvCount":"47","voteTime":"小于1分钟前"}}} 28 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113528000","lvCount":"55","nick":"chenjunkan","uvCount":"49","voteTime":"小于1分钟前"}}} 29 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113544000","lvCount":"16","nick":"chenjunkan","uvCount":"49","voteTime":"小于1分钟前"}}} 30 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113643000","lvCount":"22","nick":"chenjunkan","uvCount":"50","voteTime":"小于1分钟前"}}}
8.起始请求定制
在说起始url之前我们先来了解一个可迭代对象和迭代器,什么是可迭代对象?列表就是可迭代对象
# 可迭代对象 # li = [11,22,33] # 可迭代对象转换为迭代器 # iter(li)
生成器转换为迭代器,生成器是一种特殊的迭代器
def func(): yield 11 yield 22 yield 33 # 生成器 li = func() # 生成器 转换为迭代器 iter(li) v = li.__next__() print(v)
我们知道爬虫里面的调度器只能接收request对象,然后交给下载器去下载,因此当爬虫启动的时候在源码的内部开始不是先去执行parse方法,而是先把start_urls里面的每一个url都封装成一个一个的request对象,然后将request对象交给调度器;具体流程如下:
""" scrapy引擎来爬虫中取起始URL: 1. 调用start_requests并获取返回值 2. v = iter(返回值) 返回值是生成器或者迭代器都没关系,因为会转换为迭代器 3. req1 = 执行 v.__next__() req2 = 执行 v.__next__() req3 = 执行 v.__next__() ... 4. req全部放到调度器中 """
说明:当调度器来拿request对象的时候并不是直接通过start_urls拿,而是调用内部的start_requests这个方法,如果本类没有的话就去父类拿
start_requests方法的两种方式:
方式一:start_requests方法是一个生成器函数,在内部会将这个生成器转换为迭代器
def start_requests(self): # 方式一: for url in self.start_urls: yield Request(url=url)
方式二:start_requests方法返回一个可迭代的对象
def start_requests(self): # 方式二: req_list = [] for url in self.start_urls: req_list.append(Request(url=url)) return req_list
当我们自己不定义start_requests的时候其实我们的源码里面也是有这个方法的:进入scrapy.Spider
查看源码:
def start_requests(self): cls = self.__class__ if method_is_overridden(cls, Spider, ‘make_requests_from_url‘): warnings.warn( "Spider.make_requests_from_url method is deprecated; it " "won‘t be called in future Scrapy releases. Please " "override Spider.start_requests method instead (see %s.%s)." % ( cls.__module__, cls.__name__ ), ) for url in self.start_urls: yield self.make_requests_from_url(url) else: for url in self.start_urls: yield Request(url, dont_filter=True)
如果我们自己定义了start_requests,那么我们自己的优先级会更高,默认的情况下发的是get请求,那么这样我们可以自己改成post的请求
eg:起始的url可以去redis中获取
9.深度和优先级
# -*- coding: utf-8 -*- import scrapy from scrapy.dupefilter import RFPDupeFilter from scrapy.http.cookies import CookieJar from scrapy.http import Request class ChoutiSpider(scrapy.Spider): name = ‘chouti‘ allowed_domains = [‘dig.chouti.com‘] start_urls = [‘http://dig.chouti.com/‘] def start_requests(self): for url in self.start_urls: yield Request(url=url, callback=self.parse) def parse(self, response): print(response.meta.get("depth", 0)) page_list = response.xpath(‘//div[@id="dig_lcpage"]//a/@href‘).extract() for page in page_list: page = "https://dig.chouti.com" + page yield Request(url=page, callback=self.parse) # https://dig.chouti.com/all/hot/recent/2
当我们在执行yield Request的时候其实我们将request对象放在调度器中去,然后交给下载器进行下载,其中还需要进过很多中间件,因此在中间件里面还可以做点其他的操作:导入from scrapy.spidermiddlewares.depth import DepthMiddleware
DepthMiddleware:
class DepthMiddleware(object): def __init__(self, maxdepth, stats=None, verbose_stats=False, prio=1): self.maxdepth = maxdepth self.stats = stats self.verbose_stats = verbose_stats self.prio = prio @classmethod def from_crawler(cls, crawler): settings = crawler.settings maxdepth = settings.getint(‘DEPTH_LIMIT‘) verbose = settings.getbool(‘DEPTH_STATS_VERBOSE‘) prio = settings.getint(‘DEPTH_PRIORITY‘) return cls(maxdepth, crawler.stats, verbose, prio) def process_spider_output(self, response, result, spider): def _filter(request): if isinstance(request, Request): depth = response.meta[‘depth‘] + 1 request.meta[‘depth‘] = depth if self.prio: request.priority -= depth * self.prio if self.maxdepth and depth > self.maxdepth: logger.debug( "Ignoring link (depth > %(maxdepth)d): %(requrl)s ", {‘maxdepth‘: self.maxdepth, ‘requrl‘: request.url}, extra={‘spider‘: spider} ) return False elif self.stats: if self.verbose_stats: self.stats.inc_value(‘request_depth_count/%s‘ % depth, spider=spider) self.stats.max_value(‘request_depth_max‘, depth, spider=spider) return True # base case (depth=0) if self.stats and ‘depth‘ not in response.meta: response.meta[‘depth‘] = 0 if self.verbose_stats: self.stats.inc_value(‘request_depth_count/0‘, spider=spider) return (r for r in result or () if _filter(r))
当经过中间件的时候穿过的就是process_spider_output方法:
说明:
1.if self.stats and ‘depth‘ not in response.meta: 第一次将请求发送过去的时候Request里面的meta=None;返回的reponse里面的meta也是空的,在response里面response.request中的request就是我们开始封装的request;导入from scrapy.http import response,我们会发现response.meta: @property def meta(self): try: return self.request.meta except AttributeError: raise AttributeError( "Response.meta not available, this response " "is not tied to any request" ) 可以得出:response.meta = response.request.meta 2.当我们yield Request的时候这个时候就会触发process_spider_output这个函数,当条件成立之后就会给request.meta赋值:response.meta[‘depth‘] = 0 3.return (r for r in result or () if _filter(r)):result指的是返回的一个一个的request对象,循环每一个request对象,执行_filter方法 4.如果_filter返回True,表示这个请求是允许的,交给调度器,如果返回False,那么就丢弃 if isinstance(request, Request): depth = response.meta[‘depth‘] + 1 request.meta[‘depth‘] = depth 如果request是Request对象,那么就深度+1; 5.if self.maxdepth and depth > self.maxdepth: maxdepth = settings.getint(‘DEPTH_LIMIT‘)如果深度大于我们自己设置的深度,那么就丢弃
优先级策略:深度越深那么优先级就越低,默认的优先级为0
if self.prio: request.priority -= depth * self.prio
10.代理
scrapy内置代理:当调度器将url交给下载器进行下载的时候,那么中间会进过中间件,内置的代理就在这些中间件里面
路径:/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/downloadermiddlewares/httpproxy.py
HttpProxyMiddleware类:
1 import base64 2 from six.moves.urllib.request import getproxies, proxy_bypass 3 from six.moves.urllib.parse import unquote 4 try: 5 from urllib2 import _parse_proxy 6 except ImportError: 7 from urllib.request import _parse_proxy 8 from six.moves.urllib.parse import urlunparse 9 10 from scrapy.utils.httpobj import urlparse_cached 11 from scrapy.exceptions import NotConfigured 12 from scrapy.utils.python import to_bytes 13 14 15 class HttpProxyMiddleware(object): 16 17 def __init__(self, auth_encoding=‘latin-1‘): 18 self.auth_encoding = auth_encoding 19 self.proxies = {} 20 for type, url in getproxies().items(): 21 self.proxies[type] = self._get_proxy(url, type) 22 23 @classmethod 24 def from_crawler(cls, crawler): 25 if not crawler.settings.getbool(‘HTTPPROXY_ENABLED‘): 26 raise NotConfigured 27 auth_encoding = crawler.settings.get(‘HTTPPROXY_AUTH_ENCODING‘) 28 return cls(auth_encoding) 29 30 def _basic_auth_header(self, username, password): 31 user_pass = to_bytes( 32 ‘%s:%s‘ % (unquote(username), unquote(password)), 33 encoding=self.auth_encoding) 34 return base64.b64encode(user_pass).strip() 35 36 def _get_proxy(self, url, orig_type): 37 proxy_type, user, password, hostport = _parse_proxy(url) 38 proxy_url = urlunparse((proxy_type or orig_type, hostport, ‘‘, ‘‘, ‘‘, ‘‘)) 39 40 if user: 41 creds = self._basic_auth_header(user, password) 42 else: 43 creds = None 44 45 return creds, proxy_url 46 47 def process_request(self, request, spider): 48 # ignore if proxy is already set 49 if ‘proxy‘ in request.meta: 50 if request.meta[‘proxy‘] is None: 51 return 52 # extract credentials if present 53 creds, proxy_url = self._get_proxy(request.meta[‘proxy‘], ‘‘) 54 request.meta[‘proxy‘] = proxy_url 55 if creds and not request.headers.get(‘Proxy-Authorization‘): 56 request.headers[‘Proxy-Authorization‘] = b‘Basic ‘ + creds 57 return 58 elif not self.proxies: 59 return 60 61 parsed = urlparse_cached(request) 62 scheme = parsed.scheme 63 64 # ‘no_proxy‘ is only supported by http schemes 65 if scheme in (‘http‘, ‘https‘) and proxy_bypass(parsed.hostname): 66 return 67 68 if scheme in self.proxies: 69 self._set_proxy(request, scheme) 70 71 def _set_proxy(self, request, scheme): 72 creds, proxy = self.proxies[scheme] 73 request.meta[‘proxy‘] = proxy 74 if creds: 75 request.headers[‘Proxy-Authorization‘] = b‘Basic ‘ + creds
def process_request(self, request, spider): # ignore if proxy is already set if ‘proxy‘ in request.meta: if request.meta[‘proxy‘] is None: return # extract credentials if present creds, proxy_url = self._get_proxy(request.meta[‘proxy‘], ‘‘) request.meta[‘proxy‘] = proxy_url if creds and not request.headers.get(‘Proxy-Authorization‘): request.headers[‘Proxy-Authorization‘] = b‘Basic ‘ + creds return elif not self.proxies: return parsed = urlparse_cached(request) scheme = parsed.scheme # ‘no_proxy‘ is only supported by http schemes if scheme in (‘http‘, ‘https‘) and proxy_bypass(parsed.hostname): return if scheme in self.proxies: self._set_proxy(request, scheme)
说明:
if scheme in self.proxies: self._set_proxy(request, scheme)
self.proxy是什么?是一个self.proxies = {}空的字典:加代理的本质就是在请求头上加点数据
for type, url in getproxies().items(): self.proxies[type] = self._get_proxy(url, type)
接下来:getproxies = getproxies_environment
def getproxies_environment(): """Return a dictionary of scheme -> proxy server URL mappings. Scan the environment for variables named <scheme>_proxy; this seems to be the standard convention. If you need a different way, you can pass a proxies dictionary to the [Fancy]URLopener constructor. """ proxies = {} # in order to prefer lowercase variables, process environment in # two passes: first matches any, second pass matches lowercase only for name, value in os.environ.items(): name = name.lower() if value and name[-6:] == ‘_proxy‘: proxies[name[:-6]] = value # CVE-2016-1000110 - If we are running as CGI script, forget HTTP_PROXY # (non-all-lowercase) as it may be set from the web server by a "Proxy:" # header from the client # If "proxy" is lowercase, it will still be used thanks to the next block if ‘REQUEST_METHOD‘ in os.environ: proxies.pop(‘http‘, None) for name, value in os.environ.items(): if name[-6:] == ‘_proxy‘: name = name.lower() if value: proxies[name[:-6]] = value else: proxies.pop(name[:-6], None) return proxies
说明:a.通过os.environ从环境变量里面去获取代理,只要遵循后六位是_proxy结尾就可以,也可以通过get方法设置代理
import os v = os.environ.get(‘XXX‘) print(v)
b.通过这个方法我们能够得到一个
proxies = {
http:‘192.168.3.3‘
https:‘192.168.3.4‘
}
c.接下来self.proxies[type] = self._get_proxy(url, type)
def _get_proxy(self, url, orig_type): proxy_type, user, password, hostport = _parse_proxy(url) proxy_url = urlunparse((proxy_type or orig_type, hostport, ‘‘, ‘‘, ‘‘, ‘‘)) if user: creds = self._basic_auth_header(user, password) else: creds = None return creds, proxy_url
内置代理的写法:
在爬虫启动时,提前在os.envrion中设置代理即可。 class ChoutiSpider(scrapy.Spider): name = ‘chouti‘ allowed_domains = [‘chouti.com‘] start_urls = [‘https://dig.chouti.com/‘] cookie_dict = {} def start_requests(self): import os os.environ[‘HTTPS_PROXY‘] = "http://root:[email protected]:9999/" os.environ[‘HTTP_PROXY‘] = ‘19.11.2.32‘, for url in self.start_urls: yield Request(url=url,callback=self.parse)
但是我们还有另外的一种方式设置内置代理,在爬虫里面去设置meta
源码查看:
if ‘proxy‘ in request.meta: if request.meta[‘proxy‘] is None: return # extract credentials if present creds, proxy_url = self._get_proxy(request.meta[‘proxy‘], ‘‘) request.meta[‘proxy‘] = proxy_url if creds and not request.headers.get(‘Proxy-Authorization‘): request.headers[‘Proxy-Authorization‘] = b‘Basic ‘ + creds return elif not self.proxies: return
meta写法
class ChoutiSpider(scrapy.Spider): name = ‘chouti‘ allowed_domains = [‘chouti.com‘] start_urls = [‘https://dig.chouti.com/‘] cookie_dict = {} def start_requests(self): for url in self.start_urls: yield Request(url=url,callback=self.parse,meta={‘proxy‘:‘"http://root:[email protected]:9999/"‘})
自定义代理:
代码示例:
import base64 import random from six.moves.urllib.parse import unquote try: from urllib2 import _parse_proxy except ImportError: from urllib.request import _parse_proxy from six.moves.urllib.parse import urlunparse from scrapy.utils.python import to_bytes class CjkProxyMiddleware(object): def _basic_auth_header(self, username, password): user_pass = to_bytes( ‘%s:%s‘ % (unquote(username), unquote(password)), encoding=‘latin-1‘) return base64.b64encode(user_pass).strip() def process_request(self, request, spider): PROXIES = [ "http://root:[email protected]:9999/", "http://root:[email protected]:9999/", "http://root:[email protected]:9999/", "http://root:[email protected]:9999/", "http://root:[email protected]:9999/", "http://root:[email protected]:9999/", ] url = random.choice(PROXIES) orig_type = "" proxy_type, user, password, hostport = _parse_proxy(url) proxy_url = urlunparse((proxy_type or orig_type, hostport, ‘‘, ‘‘, ‘‘, ‘‘)) if user: creds = self._basic_auth_header(user, password) else: creds = None request.meta[‘proxy‘] = proxy_url if creds: request.headers[‘Proxy-Authorization‘] = b‘Basic ‘ + creds
11.解析器(后面会详细介绍)
代码示例:
html = """<!DOCTYPE html> <html> <head lang="en"> <meta charset="UTF-8"> <title></title> </head> <body> <ul> <li class="item-"><a id=‘i1‘ href="link.html">first item</a></li> <li class="item-0"><a id=‘i2‘ href="llink.html">first item</a></li> <li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li> </ul> <div><a href="llink2.html">second item</a></div> </body> </html> """ from scrapy.http import HtmlResponse from scrapy.selector import Selector response = HtmlResponse(url=‘http://example.com‘, body=html,encoding=‘utf-8‘) # hxs = Selector(response) # hxs.xpath() response.xpath(‘‘)
12.下载中间件
源码查看:
1 class CjkscrapyDownloaderMiddleware(object): 2 # Not all methods need to be defined. If a method is not defined, 3 # scrapy acts as if the downloader middleware does not modify the 4 # passed objects. 5 6 @classmethod 7 def from_crawler(cls, crawler): 8 # This method is used by Scrapy to create your spiders. 9 s = cls() 10 crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 11 return s 12 13 def process_request(self, request, spider): 14 # Called for each request that goes through the downloader 15 # middleware. 16 17 # Must either: 18 # - return None: continue processing this request 19 # - or return a Response object 20 # - or return a Request object 21 # - or raise IgnoreRequest: process_exception() methods of 22 # installed downloader middleware will be called 23 return None 24 25 def process_response(self, request, response, spider): 26 # Called with the response returned from the downloader. 27 28 # Must either; 29 # - return a Response object 30 # - return a Request object 31 # - or raise IgnoreRequest 32 return response 33 34 def process_exception(self, request, exception, spider): 35 # Called when a download handler or a process_request() 36 # (from other downloader middleware) raises an exception. 37 38 # Must either: 39 # - return None: continue processing this exception 40 # - return a Response object: stops process_exception() chain 41 # - return a Request object: stops process_exception() chain 42 pass 43 44 def spider_opened(self, spider): 45 spider.logger.info(‘Spider opened: %s‘ % spider.name)
自定义下载中间件:先定义一个md文件,然后在setting里面设置
DOWNLOADER_MIDDLEWARES = { # ‘cjkscrapy.middlewares.CjkscrapyDownloaderMiddleware‘: 543, ‘cjkscrapy.md.Md1‘: 666, ‘cjkscrapy.md.Md2‘: 667, }
md.py
class Md1(object): @classmethod def from_crawler(cls, crawler): s = cls() return s def process_request(self, request, spider): print(‘md1.process_request‘, request) pass def process_response(self, request, response, spider): print(‘m1.process_response‘, request, response) return response def process_exception(self, request, exception, spider): pass class Md2(object): @classmethod def from_crawler(cls, crawler): s = cls() return s def process_request(self, request, spider): print(‘md2.process_request‘, request) def process_response(self, request, response, spider): print(‘m2.process_response‘, request, response) return response def process_exception(self, request, exception, spider): pass
运行结果:
1 chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog 2 md1.process_request <GET http://dig.chouti.com/> 3 md2.process_request <GET http://dig.chouti.com/> 4 m2.process_response <GET http://dig.chouti.com/> <301 http://dig.chouti.com/> 5 m1.process_response <GET http://dig.chouti.com/> <301 http://dig.chouti.com/> 6 md1.process_request <GET https://dig.chouti.com/> 7 md2.process_request <GET https://dig.chouti.com/> 8 m2.process_response <GET https://dig.chouti.com/> <200 https://dig.chouti.com/> 9 m1.process_response <GET https://dig.chouti.com/> <200 https://dig.chouti.com/> 10 response <200 https://dig.chouti.com/>
a.返回Response对象
from scrapy.http import HtmlResponse from scrapy.http import Request class Md1(object): @classmethod def from_crawler(cls, crawler): s = cls() return s def process_request(self, request, spider): print(‘md1.process_request‘, request) # 1. 返回Response import requests result = requests.get(request.url) return HtmlResponse(url=request.url, status=200, headers=None, body=result.content) def process_response(self, request, response, spider): print(‘m1.process_response‘, request, response) return response def process_exception(self, request, exception, spider): pass class Md2(object): @classmethod def from_crawler(cls, crawler): s = cls() return s def process_request(self, request, spider): print(‘md2.process_request‘, request) def process_response(self, request, response, spider): print(‘m2.process_response‘, request, response) return response def process_exception(self, request, exception, spider): pass
运行结果:
1 chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog 2 md1.process_request <GET http://dig.chouti.com/> 3 m2.process_response <GET http://dig.chouti.com/> <200 http://dig.chouti.com/> 4 m1.process_response <GET http://dig.chouti.com/> <200 http://dig.chouti.com/> 5 response <200 http://dig.chouti.com/>
说明:通过结果可以看出,在这一步可以伪造下载的结果,直接把结果当成返回值交给下一个处理
b.返回Request
from scrapy.http import HtmlResponse from scrapy.http import Request class Md1(object): @classmethod def from_crawler(cls, crawler): s = cls() return s def process_request(self, request, spider): print(‘md1.process_request‘, request) # 2. 返回Request return Request(‘https://dig.chouti.com/r/tec/hot/1‘) def process_response(self, request, response, spider): print(‘m1.process_response‘, request, response) return response def process_exception(self, request, exception, spider): pass class Md2(object): @classmethod def from_crawler(cls, crawler): s = cls() return s def process_request(self, request, spider): print(‘md2.process_request‘, request) def process_response(self, request, response, spider): print(‘m2.process_response‘, request, response) return response def process_exception(self, request, exception, spider): pass
运行结果:
chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog md1.process_request <GET http://dig.chouti.com/> md1.process_request <GET https://dig.chouti.com/r/tec/hot/1>
c.抛出异常:必须要有process_exception
from scrapy.http import HtmlResponse from scrapy.http import Request class Md1(object): @classmethod def from_crawler(cls, crawler): s = cls() return s def process_request(self, request, spider): print(‘md1.process_request‘, request) # 3. 抛出异常 from scrapy.exceptions import IgnoreRequest raise IgnoreRequest def process_response(self, request, response, spider): print(‘m1.process_response‘, request, response) return response def process_exception(self, request, exception, spider): pass class Md2(object): @classmethod def from_crawler(cls, crawler): s = cls() return s def process_request(self, request, spider): print(‘md2.process_request‘, request) def process_response(self, request, response, spider): print(‘m2.process_response‘, request, response) return response def process_exception(self, request, exception, spider): pass
运行结果:
chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog md1.process_request <GET http://dig.chouti.com/>
d.上面的几种我们一般都不用,而是对请求进行加工(cookie)
from scrapy.http import HtmlResponse from scrapy.http import Request class Md1(object): @classmethod def from_crawler(cls, crawler): s = cls() return s def process_request(self, request, spider): print(‘md1.process_request‘, request) request.headers[ ‘user-agent‘] = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36" def process_response(self, request, response, spider): print(‘m1.process_response‘, request, response) return response def process_exception(self, request, exception, spider): pass class Md2(object): @classmethod def from_crawler(cls, crawler): s = cls() return s def process_request(self, request, spider): print(‘md2.process_request‘, request) def process_response(self, request, response, spider): print(‘m2.process_response‘, request, response) return response def process_exception(self, request, exception, spider): pass
运行结果:
1 chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog 2 md1.process_request <GET http://dig.chouti.com/> 3 md2.process_request <GET http://dig.chouti.com/> 4 m2.process_response <GET http://dig.chouti.com/> <301 http://dig.chouti.com/> 5 m1.process_response <GET http://dig.chouti.com/> <301 http://dig.chouti.com/> 6 md1.process_request <GET https://dig.chouti.com/> 7 md2.process_request <GET https://dig.chouti.com/> 8 m2.process_response <GET https://dig.chouti.com/> <200 https://dig.chouti.com/> 9 m1.process_response <GET https://dig.chouti.com/> <200 https://dig.chouti.com/> 10 response <200 https://dig.chouti.com/>
但是这个我们不需要自己写,源码里面帮我们写好了:
路径:/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/downloadermiddlewares/useragent.py
源码查看:
"""Set User-Agent header per spider or use a default value from settings""" from scrapy import signals class UserAgentMiddleware(object): """This middleware allows spiders to override the user_agent""" def __init__(self, user_agent=‘Scrapy‘): self.user_agent = user_agent @classmethod def from_crawler(cls, crawler): o = cls(crawler.settings[‘USER_AGENT‘]) crawler.signals.connect(o.spider_opened, signal=signals.spider_opened) return o def spider_opened(self, spider): self.user_agent = getattr(spider, ‘user_agent‘, self.user_agent) def process_request(self, request, spider): if self.user_agent: request.headers.setdefault(b‘User-Agent‘, self.user_agent)
13.爬虫中间件
源码查看:
1 class CjkscrapySpiderMiddleware(object): 2 # Not all methods need to be defined. If a method is not defined, 3 # scrapy acts as if the spider middleware does not modify the 4 # passed objects. 5 6 @classmethod 7 def from_crawler(cls, crawler): 8 # This method is used by Scrapy to create your spiders. 9 s = cls() 10 crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 11 return s 12 13 def process_spider_input(self, response, spider): 14 # Called for each response that goes through the spider 15 # middleware and into the spider. 16 17 # Should return None or raise an exception. 18 return None 19 20 def process_spider_output(self, response, result, spider): 21 # Called with the results returned from the Spider, after 22 # it has processed the response. 23 24 # Must return an iterable of Request, dict or Item objects. 25 for i in result: 26 yield i 27 28 def process_spider_exception(self, response, exception, spider): 29 # Called when a spider or process_spider_input() method 30 # (from other spider middleware) raises an exception. 31 32 # Should return either None or an iterable of Response, dict 33 # or Item objects. 34 pass 35 36 def process_start_requests(self, start_requests, spider): 37 # Called with the start requests of the spider, and works 38 # similarly to the process_spider_output() method, except 39 # that it doesn’t have a response associated. 40 41 # Must return only requests (not items). 42 for r in start_requests: 43 yield r 44 45 def spider_opened(self, spider): 46 spider.logger.info(‘Spider opened: %s‘ % spider.name)
自定义爬虫中间件:
在配置文件中设置:
SPIDER_MIDDLEWARES = { # ‘cjkscrapy.middlewares.CjkscrapySpiderMiddleware‘: 543, ‘cjkscrapy.sd.Sd1‘: 667, ‘cjkscrapy.sd.Sd2‘: 668, }
sd.py:
1 class Sd1(object): 2 # Not all methods need to be defined. If a method is not defined, 3 # scrapy acts as if the spider middleware does not modify the 4 # passed objects. 5 6 @classmethod 7 def from_crawler(cls, crawler): 8 # This method is used by Scrapy to create your spiders. 9 s = cls() 10 return s 11 12 def process_spider_input(self, response, spider): 13 # Called for each response that goes through the spider 14 # middleware and into the spider. 15 16 # Should return None or raise an exception. 17 return None 18 19 def process_spider_output(self, response, result, spider): 20 # Called with the results returned from the Spider, after 21 # it has processed the response. 22 23 # Must return an iterable of Request, dict or Item objects. 24 for i in result: 25 yield i 26 27 def process_spider_exception(self, response, exception, spider): 28 # Called when a spider or process_spider_input() method 29 # (from other spider middleware) raises an exception. 30 31 # Should return either None or an iterable of Response, dict 32 # or Item objects. 33 pass 34 35 # 只在爬虫启动时,执行一次。 36 def process_start_requests(self, start_requests, spider): 37 # Called with the start requests of the spider, and works 38 # similarly to the process_spider_output() method, except 39 # that it doesn’t have a response associated. 40 41 # Must return only requests (not items). 42 for r in start_requests: 43 yield r 44 45 46 class Sd2(object): 47 # Not all methods need to be defined. If a method is not defined, 48 # scrapy acts as if the spider middleware does not modify the 49 # passed objects. 50 51 @classmethod 52 def from_crawler(cls, crawler): 53 # This method is used by Scrapy to create your spiders. 54 s = cls() 55 return s 56 57 def process_spider_input(self, response, spider): 58 # Called for each response that goes through the spider 59 # middleware and into the spider. 60 61 # Should return None or raise an exception. 62 return None 63 64 def process_spider_output(self, response, result, spider): 65 # Called with the results returned from the Spider, after 66 # it has processed the response. 67 68 # Must return an iterable of Request, dict or Item objects. 69 for i in result: 70 yield i 71 72 def process_spider_exception(self, response, exception, spider): 73 # Called when a spider or process_spider_input() method 74 # (from other spider middleware) raises an exception. 75 76 # Should return either None or an iterable of Response, dict 77 # or Item objects. 78 pass 79 80 # 只在爬虫启动时,执行一次。 81 def process_start_requests(self, start_requests, spider): 82 # Called with the start requests of the spider, and works 83 # similarly to the process_spider_output() method, except 84 # that it doesn’t have a response associated. 85 86 # Must return only requests (not items). 87 for r in start_requests: 88 yield r
一般情况下这个我们不做修改,使用内置的,在爬虫中间件中先执行所有的的input方法,再执行所有的output方法
14.定制命令
我们之前所有的运行爬虫都是在命令行中运行的,非常的麻烦,那么我们可以定制一个脚本用来执行这些命令,在根目录上写上
start.py:
import sys from scrapy.cmdline import execute if __name__ == ‘__main__‘: # execute(["scrapy","crawl","chouti","--nolog"]) #单个爬虫
自定制命令:
- 在spiders同级创建任意目录,如:commands
- 在其中创建 crawlall.py 文件 (此处文件名就是自定义的命令)
from scrapy.commands import ScrapyCommand from scrapy.utils.project import get_project_settings class Command(ScrapyCommand): requires_project = True def syntax(self): return ‘[options]‘ def short_desc(self): return ‘Runs all of the spiders‘ def run(self, args, opts): spider_list = self.crawler_process.spiders.list() for name in spider_list: self.crawler_process.crawl(name, **opts.__dict__) self.crawler_process.start() crawlall.py
- 在settings.py 中添加配置 COMMANDS_MODULE = ‘项目名称.目录名称‘ #COMMANDS_MODULE = "cjkscrapy.commands"
- 在项目目录执行命令:scrapy crawlall
我们查看scrapy --help,我们添加的命令就在其中了
1 chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy --help 2 Scrapy 1.5.1 - project: cjkscrapy 3 4 Usage: 5 scrapy <command> [options] [args] 6 7 Available commands: 8 bench Run quick benchmark test 9 check Check spider contracts 10 crawl Run a spider 11 crawlall Runs all of the spiders 12 edit Edit spider 13 fetch Fetch a URL using the Scrapy downloader 14 genspider Generate new spider using pre-defined templates 15 list List available spiders 16 parse Parse URL (using its spider) and print the results 17 runspider Run a self-contained spider (without creating a project) 18 settings Get settings values 19 shell Interactive scraping console 20 startproject Create new project 21 version Print Scrapy version 22 view Open URL in browser, as seen by Scrapy
15.scrapy信号
新建一个文件:ext.py
from scrapy import signals class MyExtend(object): def __init__(self): pass @classmethod def from_crawler(cls, crawler): self = cls() crawler.signals.connect(self.x1, signal=signals.spider_opened) crawler.signals.connect(self.x2, signal=signals.spider_closed) return self def x1(self, spider): print(‘open‘) def x2(self, spider): print(‘close‘)
在setting配置文件里面设置:
EXTENSIONS = { ‘scrapy.extensions.telnet.TelnetConsole‘: None, ‘cjkscrapy.ext.MyExtend‘: 666 }
说明:crawler.signals.connect(self.x1, signal=signals.spider_opened)表示将x1这个函数注册到spider_opened里面
#与引擎相关的 engine_started = object() engine_stopped = object() #与爬虫相关的 spider_opened = object() spider_idle = object() #爬虫闲置的时候 spider_closed = object() spider_error = object() #与请求相关的 request_scheduled = object()#将请求放在调度器的时候 request_dropped = object()#将请求丢失的时候 #与相应相关的 response_received = object() response_downloaded = object() #与item相关的 item_scraped = object() item_dropped = object()
运行结果:
chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog open爬虫 response <200 https://dig.chouti.com/> close爬虫
16.scrapy_redis
scrapy_redis这个是干嘛的呢?这个组件是帮我们开发分布式爬虫的组件
redis操作:
1 import redis 2 3 conn = redis.Redis(host=‘140.143.227.206‘, port=8888, password=‘beta‘) 4 keys = conn.keys() 5 print(keys) 6 print(conn.smembers(‘dupefilter:xiaodongbei‘)) 7 # print(conn.smembers(‘visited_urls‘)) 8 # v1 = conn.sadd(‘urls‘,‘http://www.baidu.com‘) 9 # v2 = conn.sadd(‘urls‘,‘http://www.cnblogs.com‘) 10 # print(v1) 11 # print(v2) 12 # v3 = conn.sadd(‘urls‘,‘http://www.bing.com‘) 13 # print(v3) 14 15 # result = conn.sadd(‘urls‘,‘http://www.bing.com‘) 16 # if result == 1: 17 # print(‘之前未访问过‘) 18 # else: 19 # print(‘之前访问过‘) 20 # print(conn.smembers(‘urls‘))
(1).scrapy_redis去重
代码示例:
方式一:完全自定义
自定义去重规则:随便写一个文件:xxx.py
from scrapy.dupefilter import BaseDupeFilter import redis from scrapy.utils.request import request_fingerprint import scrapy_redis class DupFilter(BaseDupeFilter): def __init__(self): self.conn = redis.Redis(host=‘140.143.227.206‘, port=8888, password=‘beta‘) def request_seen(self, request): """ 检测当前请求是否已经被访问过 :param request: :return: True表示已经访问过;False表示未访问过 """ fid = request_fingerprint(request) result = self.conn.sadd(‘visited_urls‘, fid) if result == 1: return False return True
setting里面设置:
DUPEFILTER_CLASS = ‘dbd.xxx.DupFilter‘
方式二:在scrapy_redis上定制
我们通过源码分析:导入:import redis
路径:/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy_redis/dupefilter.py
RFPDupeFilter类:这个类也继承了BaseDupeFilter,跟我们的方式一差不多
1 import logging 2 import time 3 4 from scrapy.dupefilters import BaseDupeFilter 5 from scrapy.utils.request import request_fingerprint 6 7 from . import defaults 8 from .connection import get_redis_from_settings 9 10 11 logger = logging.getLogger(__name__) 12 13 14 # TODO: Rename class to RedisDupeFilter. 15 class RFPDupeFilter(BaseDupeFilter): 16 """Redis-based request duplicates filter. 17 18 This class can also be used with default Scrapy‘s scheduler. 19 20 """ 21 22 logger = logger 23 24 def __init__(self, server, key, debug=False): 25 """Initialize the duplicates filter. 26 27 Parameters 28 ---------- 29 server : redis.StrictRedis 30 The redis server instance. 31 key : str 32 Redis key Where to store fingerprints. 33 debug : bool, optional 34 Whether to log filtered requests. 35 36 """ 37 self.server = server 38 self.key = key 39 self.debug = debug 40 self.logdupes = True 41 42 @classmethod 43 def from_settings(cls, settings): 44 """Returns an instance from given settings. 45 46 This uses by default the key ``dupefilter:<timestamp>``. When using the 47 ``scrapy_redis.scheduler.Scheduler`` class, this method is not used as 48 it needs to pass the spider name in the key. 49 50 Parameters 51 ---------- 52 settings : scrapy.settings.Settings 53 54 Returns 55 ------- 56 RFPDupeFilter 57 A RFPDupeFilter instance. 58 59 60 """ 61 server = get_redis_from_settings(settings) 62 # XXX: This creates one-time key. needed to support to use this 63 # class as standalone dupefilter with scrapy‘s default scheduler 64 # if scrapy passes spider on open() method this wouldn‘t be needed 65 # TODO: Use SCRAPY_JOB env as default and fallback to timestamp. 66 key = defaults.DUPEFILTER_KEY % {‘timestamp‘: int(time.time())} 67 debug = settings.getbool(‘DUPEFILTER_DEBUG‘) 68 return cls(server, key=key, debug=debug) 69 70 @classmethod 71 def from_crawler(cls, crawler): 72 """Returns instance from crawler. 73 74 Parameters 75 ---------- 76 crawler : scrapy.crawler.Crawler 77 78 Returns 79 ------- 80 RFPDupeFilter 81 Instance of RFPDupeFilter. 82 83 """ 84 return cls.from_settings(crawler.settings) 85 86 def request_seen(self, request): 87 """Returns True if request was already seen. 88 89 Parameters 90 ---------- 91 request : scrapy.http.Request 92 93 Returns 94 ------- 95 bool 96 97 """ 98 fp = self.request_fingerprint(request) 99 # This returns the number of values added, zero if already exists. 100 added = self.server.sadd(self.key, fp) 101 return added == 0 102 103 def request_fingerprint(self, request): 104 """Returns a fingerprint for a given request. 105 106 Parameters 107 ---------- 108 request : scrapy.http.Request 109 110 Returns 111 ------- 112 str 113 114 """ 115 return request_fingerprint(request) 116 117 def close(self, reason=‘‘): 118 """Delete data on close. Called by Scrapy‘s scheduler. 119 120 Parameters 121 ---------- 122 reason : str, optional 123 124 """ 125 self.clear() 126 127 def clear(self): 128 """Clears fingerprints data.""" 129 self.server.delete(self.key) 130 131 def log(self, request, spider): 132 """Logs given request. 133 134 Parameters 135 ---------- 136 request : scrapy.http.Request 137 spider : scrapy.spiders.Spider 138 139 """ 140 if self.debug: 141 msg = "Filtered duplicate request: %(request)s" 142 self.logger.debug(msg, {‘request‘: request}, extra={‘spider‘: spider}) 143 elif self.logdupes: 144 msg = ("Filtered duplicate request %(request)s" 145 " - no more duplicates will be shown" 146 " (see DUPEFILTER_DEBUG to show all duplicates)") 147 self.logger.debug(msg, {‘request‘: request}, extra={‘spider‘: spider}) 148 self.logdupes = False
这个类里面也是通过request_seen来去重的
def request_seen(self, request): """Returns True if request was already seen. Parameters ---------- request : scrapy.http.Request Returns ------- bool """ fp = self.request_fingerprint(request) # This returns the number of values added, zero if already exists. added = self.server.sadd(self.key, fp) return added == 0
说明:
1 1.fp = self.request_fingerprint(request):创建唯一标识 2 2.added = self.server.sadd(self.key, fp): 3 self.server = server:对应的redis链接 4 (1)[email protected] 5 def from_settings(cls, settings): 6 """Returns an instance from given settings. 7 8 This uses by default the key ``dupefilter:<timestamp>``. When using the 9 ``scrapy_redis.scheduler.Scheduler`` class, this method is not used as 10 it needs to pass the spider name in the key. 11 12 Parameters 13 ---------- 14 settings : scrapy.settings.Settings 15 16 Returns 17 ------- 18 RFPDupeFilter 19 A RFPDupeFilter instance. 20 21 22 """ 23 server = get_redis_from_settings(settings) 24 # XXX: This creates one-time key. needed to support to use this 25 # class as standalone dupefilter with scrapy‘s default scheduler 26 # if scrapy passes spider on open() method this wouldn‘t be needed 27 # TODO: Use SCRAPY_JOB env as default and fallback to timestamp. 28 key = defaults.DUPEFILTER_KEY % {‘timestamp‘: int(time.time())} 29 debug = settings.getbool(‘DUPEFILTER_DEBUG‘) 30 return cls(server, key=key, debug=debug) 31 32 (2).def get_redis_from_settings(settings): 33 """Returns a redis client instance from given Scrapy settings object. 34 35 This function uses ``get_client`` to instantiate the client and uses 36 ``defaults.REDIS_PARAMS`` global as defaults values for the parameters. You 37 can override them using the ``REDIS_PARAMS`` setting. 38 39 Parameters 40 ---------- 41 settings : Settings 42 A scrapy settings object. See the supported settings below. 43 44 Returns 45 ------- 46 server 47 Redis client instance. 48 49 Other Parameters 50 ---------------- 51 REDIS_URL : str, optional 52 Server connection URL. 53 REDIS_HOST : str, optional 54 Server host. 55 REDIS_PORT : str, optional 56 Server port. 57 REDIS_ENCODING : str, optional 58 Data encoding. 59 REDIS_PARAMS : dict, optional 60 Additional client parameters. 61 62 """ 63 params = defaults.REDIS_PARAMS.copy() 64 params.update(settings.getdict(‘REDIS_PARAMS‘)) 65 # XXX: Deprecate REDIS_* settings. 66 for source, dest in SETTINGS_PARAMS_MAP.items(): 67 val = settings.get(source) 68 if val: 69 params[dest] = val 70 71 # Allow ``redis_cls`` to be a path to a class. 72 if isinstance(params.get(‘redis_cls‘), six.string_types): 73 params[‘redis_cls‘] = load_object(params[‘redis_cls‘]) 74 75 return get_redis(**params) 76 (3).def get_redis(**kwargs): 77 """Returns a redis client instance. 78 79 Parameters 80 ---------- 81 redis_cls : class, optional 82 Defaults to ``redis.StrictRedis``. 83 url : str, optional 84 If given, ``redis_cls.from_url`` is used to instantiate the class. 85 **kwargs 86 Extra parameters to be passed to the ``redis_cls`` class. 87 88 Returns 89 ------- 90 server 91 Redis client instance. 92 93 """ 94 redis_cls = kwargs.pop(‘redis_cls‘, defaults.REDIS_CLS) 95 url = kwargs.pop(‘url‘, None) 96 if url: 97 return redis_cls.from_url(url, **kwargs) 98 else: 99 return redis_cls(**kwargs) 100 (4).redis_cls = kwargs.pop(‘redis_cls‘, defaults.REDIS_CLS) 101 REDIS_CLS = redis.StrictRedis 102 103 因此在配置文件setting里面要加 104 # ############### scrapy redis连接 #################### 105 106 REDIS_HOST = ‘140.143.227.206‘ # 主机名 107 REDIS_PORT = 8888 # 端口 108 REDIS_PARAMS = {‘password‘:‘beta‘} # Redis连接参数 默认:REDIS_PARAMS = {‘socket_timeout‘: 30,‘socket_connect_timeout‘: 30,‘retry_on_timeout‘: True,‘encoding‘: REDIS_ENCODING,}) 109 REDIS_ENCODING = "utf-8" # redis编码类型 默认:‘utf-8‘ 110 111 # REDIS_URL = ‘redis://user:[email protected]:9001‘ # 连接URL(优先于以上配置) 112 DUPEFILTER_KEY = ‘dupefilter:%(timestamp)s‘ 113 114 self.key = key:key = defaults.DUPEFILTER_KEY % {‘timestamp‘: int(time.time())} 115 3.return added == 0 116 如果为0的话就代表访问过了,也就是返回的true,如果范湖为1的话那么代表没有访问过,也就是返回False
代码示例:
from scrapy_redis.dupefilter import RFPDupeFilter from scrapy_redis.connection import get_redis_from_settings from scrapy_redis import defaults class RedisDupeFilter(RFPDupeFilter): @classmethod def from_settings(cls, settings): """Returns an instance from given settings. This uses by default the key ``dupefilter:<timestamp>``. When using the ``scrapy_redis.scheduler.Scheduler`` class, this method is not used as it needs to pass the spider name in the key. Parameters ---------- settings : scrapy.settings.Settings Returns ------- RFPDupeFilter A RFPDupeFilter instance. """ server = get_redis_from_settings(settings) # XXX: This creates one-time key. needed to support to use this # class as standalone dupefilter with scrapy‘s default scheduler # if scrapy passes spider on open() method this wouldn‘t be needed # TODO: Use SCRAPY_JOB env as default and fallback to timestamp. key = defaults.DUPEFILTER_KEY % {‘timestamp‘: ‘chenjunkan‘} debug = settings.getbool(‘DUPEFILTER_DEBUG‘) return cls(server, key=key, debug=debug)
这里面的key默认是用时间戳的形式,而我们可以将其写死key = defaults.DUPEFILTER_KEY % {‘timestamp‘: ‘chenjunkan‘}
最后在配置文件里面:
# ############### scrapy redis连接 #################### REDIS_HOST = ‘140.143.227.206‘ # 主机名 REDIS_PORT = 8888 # 端口 REDIS_PARAMS = {‘password‘:‘beta‘} # Redis连接参数 默认:REDIS_PARAMS = {‘socket_timeout‘: 30,‘socket_connect_timeout‘: 30,‘retry_on_timeout‘: True,‘encoding‘: REDIS_ENCODING,}) REDIS_ENCODING = "utf-8" # redis编码类型 默认:‘utf-8‘ # REDIS_URL = ‘redis://user:[email protected]:9001‘ # 连接URL(优先于以上配置) DUPEFILTER_KEY = ‘dupefilter:%(timestamp)s‘ # DUPEFILTER_CLASS = ‘scrapy_redis.dupefilter.RFPDupeFilter‘ DUPEFILTER_CLASS = ‘cjkscrapy.xxx.RedisDupeFilter‘
(2).scrapy_redis队列
redis实现队列和栈:
查看源码:导入 import scrapy_redis
路径:/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy_redis/queue.py
queue.py:
1 from scrapy.utils.reqser import request_to_dict, request_from_dict 2 3 from . import picklecompat 4 5 6 class Base(object): 7 """Per-spider base queue class""" 8 9 def __init__(self, server, spider, key, serializer=None): 10 """Initialize per-spider redis queue. 11 12 Parameters 13 ---------- 14 server : StrictRedis 15 Redis client instance. 16 spider : Spider 17 Scrapy spider instance. 18 key: str 19 Redis key where to put and get messages. 20 serializer : object 21 Serializer object with ``loads`` and ``dumps`` methods. 22 23 """ 24 if serializer is None: 25 # Backward compatibility. 26 # TODO: deprecate pickle. 27 serializer = picklecompat 28 if not hasattr(serializer, ‘loads‘): 29 raise TypeError("serializer does not implement ‘loads‘ function: %r" 30 % serializer) 31 if not hasattr(serializer, ‘dumps‘): 32 raise TypeError("serializer ‘%s‘ does not implement ‘dumps‘ function: %r" 33 % serializer) 34 35 self.server = server 36 self.spider = spider 37 self.key = key % {‘spider‘: spider.name} 38 self.serializer = serializer 39 40 def _encode_request(self, request): 41 """Encode a request object""" 42 obj = request_to_dict(request, self.spider) 43 return self.serializer.dumps(obj) 44 45 def _decode_request(self, encoded_request): 46 """Decode an request previously encoded""" 47 obj = self.serializer.loads(encoded_request) 48 return request_from_dict(obj, self.spider) 49 50 def __len__(self): 51 """Return the length of the queue""" 52 raise NotImplementedError 53 54 def push(self, request): 55 """Push a request""" 56 raise NotImplementedError 57 58 def pop(self, timeout=0): 59 """Pop a request""" 60 raise NotImplementedError 61 62 def clear(self): 63 """Clear queue/stack""" 64 self.server.delete(self.key) 65 66 67 class FifoQueue(Base): 68 """Per-spider FIFO queue""" 69 70 def __len__(self): 71 """Return the length of the queue""" 72 return self.server.llen(self.key) 73 74 def push(self, request): 75 """Push a request""" 76 self.server.lpush(self.key, self._encode_request(request)) 77 78 def pop(self, timeout=0): 79 """Pop a request""" 80 if timeout > 0: 81 data = self.server.brpop(self.key, timeout) 82 if isinstance(data, tuple): 83 data = data[1] 84 else: 85 data = self.server.rpop(self.key) 86 if data: 87 return self._decode_request(data) 88 89 90 class PriorityQueue(Base): 91 """Per-spider priority queue abstraction using redis‘ sorted set""" 92 93 def __len__(self): 94 """Return the length of the queue""" 95 return self.server.zcard(self.key) 96 97 def push(self, request): 98 """Push a request""" 99 data = self._encode_request(request) 100 score = -request.priority 101 # We don‘t use zadd method as the order of arguments change depending on 102 # whether the class is Redis or StrictRedis, and the option of using 103 # kwargs only accepts strings, not bytes. 104 self.server.execute_command(‘ZADD‘, self.key, score, data) 105 106 def pop(self, timeout=0): 107 """ 108 Pop a request 109 timeout not support in this queue class 110 """ 111 # use atomic range/remove using multi/exec 112 pipe = self.server.pipeline() 113 pipe.multi() 114 pipe.zrange(self.key, 0, 0).zremrangebyrank(self.key, 0, 0) 115 results, count = pipe.execute() 116 if results: 117 return self._decode_request(results[0]) 118 119 120 class LifoQueue(Base): 121 """Per-spider LIFO queue.""" 122 123 def __len__(self): 124 """Return the length of the stack""" 125 return self.server.llen(self.key) 126 127 def push(self, request): 128 """Push a request""" 129 self.server.lpush(self.key, self._encode_request(request)) 130 131 def pop(self, timeout=0): 132 """Pop a request""" 133 if timeout > 0: 134 data = self.server.blpop(self.key, timeout) 135 if isinstance(data, tuple): 136 data = data[1] 137 else: 138 data = self.server.lpop(self.key) 139 140 if data: 141 return self._decode_request(data) 142 143 144 # TODO: Deprecate the use of these names. 145 SpiderQueue = FifoQueue 146 SpiderStack = LifoQueue 147 SpiderPriorityQueue = PriorityQueue
队列:先进先出
代码示例:
import scrapy_redis import redis class FifoQueue(object): def __init__(self): self.server = redis.Redis(host=‘140.143.227.206‘,port=8888,password=‘beta‘) def push(self, request): """Push a request""" self.server.lpush(‘USERS‘, request) def pop(self, timeout=0): """Pop a request""" data = self.server.rpop(‘USERS‘) return data # [33,22,11] q = FifoQueue() q.push(11) q.push(22) q.push(33) print(q.pop()) print(q.pop()) print(q.pop())
说明:lpush从左边添加,rpop从右边拿出来,也就是先进先出,广度优先
栈:后进先出:
import redis class LifoQueue(object): """Per-spider LIFO queue.""" def __init__(self): self.server = redis.Redis(host=‘140.143.227.206‘,port=8888,password=‘beta‘) def push(self, request): """Push a request""" self.server.lpush("USERS", request) def pop(self, timeout=0): """Pop a request""" data = self.server.lpop(‘USERS‘) return # [33,22,11]
说明:lpush从左边进去,lpop从左边出去,后进先出,深度优先
zadd,zrange函数:
import redis conn = redis.Redis(host=‘140.143.227.206‘, port=8888, password=‘beta‘) conn.zadd(‘score‘, cjk=79, pyy=33, cc=73) print(conn.keys()) v = conn.zrange(‘score‘, 0, 8, desc=True) print(v) pipe = conn.pipeline()# 打包,一次性执行多条命令 pipe.multi() pipe.zrange("score", 0, 0).zremrangebyrank(‘score‘, 0, 0)# 表示默认从小到大排序,只取第一个;根据排名删除第一个 results, count = pipe.execute() print(results, count)
zadd函数:将 cjk=79, pyy=33, cc=73三个值放入redis里面去,名字为score
zrange函数:可以给score设置一个区间,让他按照从小到大进行排序;v = conn.zrange(‘score‘, 0, 8, desc=True)如果将desc设置为true表示按照分值从大到小排取
优先级队列:
import redis class PriorityQueue(object): """Per-spider priority queue abstraction using redis‘ sorted set""" def __init__(self): self.server = redis.Redis(host=‘140.143.227.206‘, port=8888, password=‘beta‘) def push(self, request, score): """Push a request""" # data = self._encode_request(request) # score = -request.priority # We don‘t use zadd method as the order of arguments change depending on # whether the class is Redis or StrictRedis, and the option of using # kwargs only accepts strings, not bytes. self.server.execute_command(‘ZADD‘, ‘xxxxxx‘, score, request) def pop(self, timeout=0): """ Pop a request timeout not support in this queue class """ # use atomic range/remove using multi/exec pipe = self.server.pipeline() pipe.multi() pipe.zrange(‘xxxxxx‘, 0, 0).zremrangebyrank(‘xxxxxx‘, 0, 0) results, count = pipe.execute() if results: return results[0] q = PriorityQueue() q.push(‘alex‘, 99) q.push(‘oldboy‘, 56) q.push(‘eric‘, 77) v1 = q.pop() print(v1) v2 = q.pop() print(v2) v3 = q.pop() print(v3)
说明:如果分数一样的话,那么就根据名字来排优先级,如果按照优先级从小到大就是广度优先,如果从大到小就是深度优先
(3)scrapy_redis调度器
调度器的配置文件:
# ###################### 调度器 ###################### from scrapy_redis.scheduler import Scheduler # 由scrapy_redis的调度器来进行负责调配 # enqueue_request: 向调度器中添加任务 # next_request: 去调度器中获取一个任务 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 规定任务存放的顺序 # 优先级 DEPTH_PRIORITY = 1 # 广度优先 # DEPTH_PRIORITY = -1 # 深度优先 SCHEDULER_QUEUE_CLASS = ‘scrapy_redis.queue.PriorityQueue‘ # 默认使用优先级队列(默认),其他:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表) # 广度优先 # SCHEDULER_QUEUE_CLASS = ‘scrapy_redis.queue.FifoQueue‘ # 默认使用优先级队列(默认),其他:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表) # 深度优先 # SCHEDULER_QUEUE_CLASS = ‘scrapy_redis.queue.LifoQueue‘ # 默认使用优先级队列(默认),其他:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表) """ redis = { chouti:requests:[ pickle.dumps(Request(url=‘Http://wwwww‘,callback=self.parse)), pickle.dumps(Request(url=‘Http://wwwww‘,callback=self.parse)), pickle.dumps(Request(url=‘Http://wwwww‘,callback=self.parse)), ], cnblogs:requests:[ ] } """ SCHEDULER_QUEUE_KEY = ‘%(spider)s:requests‘ # 调度器中请求存放在redis中的key SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat" # 对保存到redis中的数据进行序列化,默认使用pickle SCHEDULER_PERSIST = False # 是否在关闭时候保留原来的调度器和去重记录,True=保留,False=清空 SCHEDULER_FLUSH_ON_START = True # 是否在开始之前清空 调度器和去重记录,True=清空,False=不清空 # SCHEDULER_IDLE_BEFORE_CLOSE = 10 # 去调度器中获取数据时,如果为空,最多等待时间(最后没数据,未获取到)。 SCHEDULER_DUPEFILTER_KEY = ‘%(spider)s:dupefilter‘ # 去重规则,在redis中保存时对应的key SCHEDULER_DUPEFILTER_CLASS = ‘scrapy_redis.dupefilter.RFPDupeFilter‘ # 去重规则对应处理的类
说明:
a.from scrapy_redis.scheduler import Scheduler:表示由scrapy的调度器来调度
1 import importlib 2 import six 3 4 from scrapy.utils.misc import load_object 5 6 from . import connection, defaults 7 8 9 # TODO: add SCRAPY_JOB support. 10 class Scheduler(object): 11 """Redis-based scheduler 12 13 Settings 14 -------- 15 SCHEDULER_PERSIST : bool (default: False) 16 Whether to persist or clear redis queue. 17 SCHEDULER_FLUSH_ON_START : bool (default: False) 18 Whether to flush redis queue on start. 19 SCHEDULER_IDLE_BEFORE_CLOSE : int (default: 0) 20 How many seconds to wait before closing if no message is received. 21 SCHEDULER_QUEUE_KEY : str 22 Scheduler redis key. 23 SCHEDULER_QUEUE_CLASS : str 24 Scheduler queue class. 25 SCHEDULER_DUPEFILTER_KEY : str 26 Scheduler dupefilter redis key. 27 SCHEDULER_DUPEFILTER_CLASS : str 28 Scheduler dupefilter class. 29 SCHEDULER_SERIALIZER : str 30 Scheduler serializer. 31 32 """ 33 34 def __init__(self, server, 35 persist=False, 36 flush_on_start=False, 37 queue_key=defaults.SCHEDULER_QUEUE_KEY, 38 queue_cls=defaults.SCHEDULER_QUEUE_CLASS, 39 dupefilter_key=defaults.SCHEDULER_DUPEFILTER_KEY, 40 dupefilter_cls=defaults.SCHEDULER_DUPEFILTER_CLASS, 41 idle_before_close=0, 42 serializer=None): 43 """Initialize scheduler. 44 45 Parameters 46 ---------- 47 server : Redis 48 The redis server instance. 49 persist : bool 50 Whether to flush requests when closing. Default is False. 51 flush_on_start : bool 52 Whether to flush requests on start. Default is False. 53 queue_key : str 54 Requests queue key. 55 queue_cls : str 56 Importable path to the queue class. 57 dupefilter_key : str 58 Duplicates filter key. 59 dupefilter_cls : str 60 Importable path to the dupefilter class. 61 idle_before_close : int 62 Timeout before giving up. 63 64 """ 65 if idle_before_close < 0: 66 raise TypeError("idle_before_close cannot be negative") 67 68 self.server = server 69 self.persist = persist 70 self.flush_on_start = flush_on_start 71 self.queue_key = queue_key 72 self.queue_cls = queue_cls 73 self.dupefilter_cls = dupefilter_cls 74 self.dupefilter_key = dupefilter_key 75 self.idle_before_close = idle_before_close 76 self.serializer = serializer 77 self.stats = None 78 79 def __len__(self): 80 return len(self.queue) 81 82 @classmethod 83 def from_settings(cls, settings): 84 kwargs = { 85 ‘persist‘: settings.getbool(‘SCHEDULER_PERSIST‘), 86 ‘flush_on_start‘: settings.getbool(‘SCHEDULER_FLUSH_ON_START‘), 87 ‘idle_before_close‘: settings.getint(‘SCHEDULER_IDLE_BEFORE_CLOSE‘), 88 } 89 90 # If these values are missing, it means we want to use the defaults. 91 optional = { 92 # TODO: Use custom prefixes for this settings to note that are 93 # specific to scrapy-redis. 94 ‘queue_key‘: ‘SCHEDULER_QUEUE_KEY‘, 95 ‘queue_cls‘: ‘SCHEDULER_QUEUE_CLASS‘, 96 ‘dupefilter_key‘: ‘SCHEDULER_DUPEFILTER_KEY‘, 97 # We use the default setting name to keep compatibility. 98 ‘dupefilter_cls‘: ‘DUPEFILTER_CLASS‘, 99 ‘serializer‘: ‘SCHEDULER_SERIALIZER‘, 100 } 101 for name, setting_name in optional.items(): 102 val = settings.get(setting_name) 103 if val: 104 kwargs[name] = val 105 106 # Support serializer as a path to a module. 107 if isinstance(kwargs.get(‘serializer‘), six.string_types): 108 kwargs[‘serializer‘] = importlib.import_module(kwargs[‘serializer‘]) 109 110 server = connection.from_settings(settings) 111 # Ensure the connection is working. 112 server.ping() 113 114 return cls(server=server, **kwargs) 115 116 @classmethod 117 def from_crawler(cls, crawler): 118 instance = cls.from_settings(crawler.settings) 119 # FIXME: for now, stats are only supported from this constructor 120 instance.stats = crawler.stats 121 return instance 122 123 def open(self, spider): 124 self.spider = spider 125 126 try: 127 self.queue = load_object(self.queue_cls)( 128 server=self.server, 129 spider=spider, 130 key=self.queue_key % {‘spider‘: spider.name}, 131 serializer=self.serializer, 132 ) 133 except TypeError as e: 134 raise ValueError("Failed to instantiate queue class ‘%s‘: %s", 135 self.queue_cls, e) 136 137 try: 138 self.df = load_object(self.dupefilter_cls)( 139 server=self.server, 140 key=self.dupefilter_key % {‘spider‘: spider.name}, 141 debug=spider.settings.getbool(‘DUPEFILTER_DEBUG‘), 142 ) 143 except TypeError as e: 144 raise ValueError("Failed to instantiate dupefilter class ‘%s‘: %s", 145 self.dupefilter_cls, e) 146 147 if self.flush_on_start: 148 self.flush() 149 # notice if there are requests already in the queue to resume the crawl 150 if len(self.queue): 151 spider.log("Resuming crawl (%d requests scheduled)" % len(self.queue)) 152 153 def close(self, reason): 154 if not self.persist: 155 self.flush() 156 157 def flush(self): 158 self.df.clear() 159 self.queue.clear() 160 161 def enqueue_request(self, request): 162 if not request.dont_filter and self.df.request_seen(request): 163 self.df.log(request, self.spider) 164 return False 165 if self.stats: 166 self.stats.inc_value(‘scheduler/enqueued/redis‘, spider=self.spider) 167 self.queue.push(request) 168 return True 169 170 def next_request(self): 171 block_pop_timeout = self.idle_before_close 172 request = self.queue.pop(block_pop_timeout) 173 if request and self.stats: 174 self.stats.inc_value(‘scheduler/dequeued/redis‘, spider=self.spider) 175 return request 176 177 def has_pending_requests(self): 178 return len(self) > 0
在这个类里面比较重要的方法:enqueue_request和next_request:
当我们的爬虫程序运行起来的时候,只有一个调度器,调用enqueue_request往队列里面加入一个请求,添加一个任务的时候就是调用enqueue_request,如果下载器来去任务就会调用next_request,因此我们肯定会调用队列;
b.SCHEDULER_QUEUE_CLASS = ‘scrapy_redis.queue.PriorityQueue‘:默认使用的是优先级队列
队列其实就是redis里面的那个列表,这个列表应该会有一个key:SCHEDULER_QUEUE_KEY = ‘%(spider)s:requests‘ # 调度器中请求存放在redis中的key,取得是当前爬虫的名称;因为redis队列里面只能放字符串,那么我们需要通过SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat" 进行序列化# 对保存到redis中的数据进行序列化,默认使用pickle
1 """A pickle wrapper module with protocol=-1 by default.""" 2 3 try: 4 import cPickle as pickle # PY2 5 except ImportError: 6 import pickle 7 8 9 def loads(s): 10 return pickle.loads(s) 11 12 13 def dumps(obj): 14 return pickle.dumps(obj, protocol=-1)
c.SCHEDULER_IDLE_BEFORE_CLOSE = 10 # 去调度器中获取数据时,如果为空,最多等待时间(最后没数据,未获取到)。表示阻塞的去取任务
import redis conn = redis.Redis(host=‘140.143.227.206‘,port=8888,password=‘beta‘) # conn.flushall() print(conn.keys()) # chouti:dupefilter/chouti:request # conn.lpush(‘xxx:request‘,‘http://wwww.xxx.com‘) # conn.lpush(‘xxx:request‘,‘http://wwww.xxx1.com‘) # print(conn.lpop(‘xxx:request‘)) # print(conn.blpop(‘xxx:request‘,timeout=10))
整个的执行流程:
1. 当执行scrapy crawl chouti --nolog 2. 找到 SCHEDULER = "scrapy_redis.scheduler.Scheduler" 配置并实例化调度器对象 - 执行Scheduler.from_crawler @classmethod def from_crawler(cls, crawler): instance = cls.from_settings(crawler.settings) # FIXME: for now, stats are only supported from this constructor instance.stats = crawler.stats return instance - 执行Scheduler.from_settings @classmethod def from_settings(cls, settings): kwargs = { ‘persist‘: settings.getbool(‘SCHEDULER_PERSIST‘), ‘flush_on_start‘: settings.getbool(‘SCHEDULER_FLUSH_ON_START‘), ‘idle_before_close‘: settings.getint(‘SCHEDULER_IDLE_BEFORE_CLOSE‘), } # If these values are missing, it means we want to use the defaults. optional = { # TODO: Use custom prefixes for this settings to note that are # specific to scrapy-redis. ‘queue_key‘: ‘SCHEDULER_QUEUE_KEY‘, ‘queue_cls‘: ‘SCHEDULER_QUEUE_CLASS‘, ‘dupefilter_key‘: ‘SCHEDULER_DUPEFILTER_KEY‘, # We use the default setting name to keep compatibility. ‘dupefilter_cls‘: ‘DUPEFILTER_CLASS‘, ‘serializer‘: ‘SCHEDULER_SERIALIZER‘, } for name, setting_name in optional.items(): val = settings.get(setting_name) if val: kwargs[name] = val # Support serializer as a path to a module. if isinstance(kwargs.get(‘serializer‘), six.string_types): kwargs[‘serializer‘] = importlib.import_module(kwargs[‘serializer‘]) server = connection.from_settings(settings) # Ensure the connection is working. server.ping() return cls(server=server, **kwargs) - 读取配置文件: SCHEDULER_PERSIST # 是否在关闭时候保留原来的调度器和去重记录,True=保留,False=清空 SCHEDULER_FLUSH_ON_START # 是否在开始之前清空 调度器和去重记录,True=清空,False=不清空 SCHEDULER_IDLE_BEFORE_CLOSE # 去调度器中获取数据时,如果为空,最多等待时间(最后没数据,未获取到)。 - 读取配置文件: SCHEDULER_QUEUE_KEY # %(spider)s:requests SCHEDULER_QUEUE_CLASS # scrapy_redis.queue.FifoQueue SCHEDULER_DUPEFILTER_KEY # ‘%(spider)s:dupefilter‘ DUPEFILTER_CLASS # ‘scrapy_redis.dupefilter.RFPDupeFilter‘ SCHEDULER_SERIALIZER # "scrapy_redis.picklecompat" - 读取配置文件: REDIS_HOST = ‘140.143.227.206‘ # 主机名 REDIS_PORT = 8888 # 端口 REDIS_PARAMS = {‘password‘:‘beta‘} # Redis连接参数 默认:REDIS_PARAMS = {‘socket_timeout‘: 30,‘socket_connect_timeout‘: 30,‘retry_on_timeout‘: True,‘encoding‘: REDIS_ENCODING,}) REDIS_ENCODING = "utf-8" - 实例化Scheduler对象 3. 爬虫开始执行起始URL - 调用 scheduler.enqueue_requests() def enqueue_request(self, request): # 请求是否需要过滤? # 去重规则中是否已经有?(是否已经访问过,如果未访问添加到去重记录中。) if not request.dont_filter and self.df.request_seen(request): self.df.log(request, self.spider) # 已经访问过就不要再访问了 return False if self.stats: self.stats.inc_value(‘scheduler/enqueued/redis‘, spider=self.spider) # print(‘未访问过,添加到调度器‘, request) self.queue.push(request) return True 4. 下载器去调度器中获取任务,去下载 - 调用 scheduler.next_requests() def next_request(self): block_pop_timeout = self.idle_before_close request = self.queue.pop(block_pop_timeout) if request and self.stats: self.stats.inc_value(‘scheduler/dequeued/redis‘, spider=self.spider) return request
深度优先和广度优先的第二种方式:
# 规定任务存放的顺序 # 优先级 DEPTH_PRIORITY = 1 # 广度优先 # DEPTH_PRIORITY = -1 # 深度优先
说明:当爬虫爬取的时候我们的优先级设置的是越往深处越低request.priority -= depth * self.prio,当我们设置DEPTH_PRIORITY = 1的时候,也就是越往深度优先级越来越小,然后在我们的scrapy_redis里面score = -request.priority,这里面的分数就会变得越来越大,表示从小到大,说明是广度优先,反过来就是深度优先
scrapy中 调度器 和 队列 和 dupefilter的关系? 调度器,调配添加或获取那个request. 队列,存放request。 dupefilter,访问记录。
注意:
# 优先使用DUPEFILTER_CLASS,如果没有就是用SCHEDULER_DUPEFILTER_CLASS SCHEDULER_DUPEFILTER_CLASS = ‘scrapy_redis.dupefilter.RFPDupeFilter‘ # 去重规则对应处理的类
(4).redis.spider
# -*- coding: utf-8 -*- import scrapy from scrapy.http import Request import scrapy_redis from scrapy_redis.spiders import RedisSpider class ChoutiSpider(RedisSpider): name = ‘chouti‘ allowed_domains = [‘chouti.com‘]def parse(self, response): print(response)
说明:如果我们from scrapy_redis.spiders import RedisSpider,让我们的爬虫继承RedisSpider
1 class RedisSpider(RedisMixin, Spider): 2 """Spider that reads urls from redis queue when idle. 3 4 Attributes 5 ---------- 6 redis_key : str (default: REDIS_START_URLS_KEY) 7 Redis key where to fetch start URLs from.. 8 redis_batch_size : int (default: CONCURRENT_REQUESTS) 9 Number of messages to fetch from redis on each attempt. 10 redis_encoding : str (default: REDIS_ENCODING) 11 Encoding to use when decoding messages from redis queue. 12 13 Settings 14 -------- 15 REDIS_START_URLS_KEY : str (default: "<spider.name>:start_urls") 16 Default Redis key where to fetch start URLs from.. 17 REDIS_START_URLS_BATCH_SIZE : int (deprecated by CONCURRENT_REQUESTS) 18 Default number of messages to fetch from redis on each attempt. 19 REDIS_START_URLS_AS_SET : bool (default: False) 20 Use SET operations to retrieve messages from the redis queue. If False, 21 the messages are retrieve using the LPOP command. 22 REDIS_ENCODING : str (default: "utf-8") 23 Default encoding to use when decoding messages from redis queue. 24 25 """ 26 27 @classmethod 28 def from_crawler(self, crawler, *args, **kwargs): 29 obj = super(RedisSpider, self).from_crawler(crawler, *args, **kwargs) 30 obj.setup_redis(crawler) 31 return obj
可以控制起始的url是去列表里面取还是去集合里面取
fetch_one = self.server.spop if use_set else self.server.lpop
setting配置文件里面:
REDIS_START_URLS_AS_SET = False
import redis conn = redis.Redis(host=‘140.143.227.206‘,port=8888,password=‘beta‘) conn.lpush(‘chouti:start_urls‘,‘https://dig.chouti.com/r/pic/hot/1‘) #可以用来控制来一个链接就去执行一个链接,定制起始的url,可以源源不断的去发请求
(5).setting配置文件
1.BOT_NAME = ‘cjk‘ # 表示爬虫名称 2.SPIDER_MODULES = [‘cjk.spiders‘]# 爬虫所放在的目录 3.NEWSPIDER_MODULE = ‘cjk.spiders‘ #新创建爬虫的时候爬虫所在的目录 4.USER_AGENT = ‘dbd (+http://www.yourdomain.com)‘ #请求头 5.ROBOTSTXT_OBEY = False #如果是true的时候表示遵循别人的爬虫协议,False就表示 6.CONCURRENT_REQUESTS = 32# 所有的爬虫并发请求数,假设有两个爬虫,有可能一个爬虫15个,另外一个17个 7.DOWNLOAD_DELAY = 3#表示每次去下载页面的时候去延迟3秒 8.CONCURRENT_REQUESTS_PER_DOMAIN = 16 #每一域名并发16个,或者换句话说每一个爬虫16个 9.CONCURRENT_REQUESTS_PER_IP = 16# 有时候我们一个域名有可能会有多个ip假设为2个ip,那么总并发数为16*2=32个 10.COOKIES_ENABLED = False# 表示内部帮我们设置cookie 11. TELNETCONSOLE_ENABLED = True,#表示如果爬虫正在运行的时候可以让他终止,然后又重新运行 #TELNETCONSOLE_HOST = ‘127.0.0.1‘ # TELNETCONSOLE_PORT = [6023,] 12.DEFAULT_REQUEST_HEADERS# 默认请求头 13.#AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False from scrapy.contrib.throttle import AutoThrottle """ 自动限速算法 from scrapy.contrib.throttle import AutoThrottle 自动限速设置 1. 获取最小延迟 DOWNLOAD_DELAY 2. 获取最大延迟 AUTOTHROTTLE_MAX_DELAY 3. 设置初始下载延迟 AUTOTHROTTLE_START_DELAY 4. 当请求下载完成后,获取其"连接"时间 latency,即:请求连接到接受到响应头之间的时间 5. 用于计算的... AUTOTHROTTLE_TARGET_CONCURRENCY target_delay = latency / self.target_concurrency new_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延迟时间 new_delay = max(target_delay, new_delay) new_delay = min(max(self.mindelay, new_delay), self.maxdelay) slot.delay = new_delay """ 14.# HTTPCACHE_ENABLED = True # 是否启用缓存 # HTTPCACHE_EXPIRATION_SECS = 0 # HTTPCACHE_DIR = ‘httpcache‘ # HTTPCACHE_IGNORE_HTTP_CODES = [] # HTTPCACHE_STORAGE = ‘scrapy.extensions.httpcache.FilesystemCacheStorage‘ #比如说我们在地铁上没有网,那么我们在出去之前我们可以将我们的网页下载到本地,然后去本地去拿,这样在我们本地就会出现一个叫httpcache的文件
原文地址:https://www.cnblogs.com/junkanchen/p/10225600.html