爬虫相关之浅聊爬虫

1.安装:要是说到爬虫,我们不得不提一个大而全的爬虫组件/框架,这个框架就是scrapy:scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中。那么我们直接进入正题,先说说这个框架的两种安装方式:

第一种:windows环境下的安装需要以下几步操作

1.下载twisted:http://www.lfd.uci.edu/~gohlke/pythonlibs/
2.pip3 install wheel
3.pip3 install Twisted?18.4.0?cp36?cp36m?win_amd64.whl #根据自己的版本找对应的版本进行安装
4.pip3 install pywin32
5.pip3 install scrapy

第二种:在linux系统下安装,苹果的mac下的安装方式也是一样

pip3 install scrapy

2.scrapy的基本使用:Django与Scrapy的使用对比

Django:

# 创建django project
django-admin startproject mysite 

cd mysite

# 创建app
python manage.py startapp app01
python manage.py startapp app02 

# 启动项目
python manage.runserver

Scrapy:

# 创建scrapy project
  scrapy startproject cjk

  cd cjk

# 创建爬虫
  scrapy genspider chouti chouti.com
  scrapy genspider cnblogs cnblogs.com 

# 启动爬虫
  scrapy crawl chouti

安装好scrapy之后进入cmd命令行查看是否安装成功:scrapy,如果见到如下的提示就代表安装好了

 1 Last login: Sat Jan  5 18:14:13 on ttys000
 2 chenjunkandeMBP:~ chenjunkan$ scrapy
 3 Scrapy 1.5.1 - no active project
 4
 5 Usage:
 6   scrapy <command> [options] [args]
 7
 8 Available commands:
 9   bench         Run quick benchmark test
10   fetch         Fetch a URL using the Scrapy downloader
11   genspider     Generate new spider using pre-defined templates
12   runspider     Run a self-contained spider (without creating a project)
13   settings      Get settings values
14   shell         Interactive scraping console
15   startproject  Create new project
16   version       Print Scrapy version
17   view          Open URL in browser, as seen by Scrapy
18
19   [ more ]      More commands available when run from project directory
20
21 Use "scrapy <command> -h" to see more info about a command

scrapy项目目录的查看:

创建project
  scrapy startproject 项目名称
    项目名称
    项目名称/
      - spiders # 爬虫文件
        - chouti.py
        - cnblgos.py
  - items.py # 持久化
  - pipelines # 持久化
  - middlewares.py # 中间件
  - settings.py # 配置文件(爬虫)
  scrapy.cfg # 配置文件(部署)

如何启动爬虫?我们来看下面这个简单的例子:

# -*- coding: utf-8 -*-
import scrapy

class ChoutiSpider(scrapy.Spider):
    # 爬虫的名称
    name = ‘chouti‘
    # 定向爬虫,只能爬取域名是gig.chouti.com的网站才能爬取
    allowed_domains = [‘dig.chouti.com‘]
    # 起始的url
    start_urls = [‘http://dig.chouti.com/‘]

    # 回调函数,起始的url执行结束之后就会自动去调这个函数
    def parse(self, response):
        # <200 https://dig.chouti.com/> <class ‘scrapy.http.response.html.HtmlResponse‘>
        print(response, type(response))

通过看源码得知from scrapy.http.response.html import HtmlResponse这个里面的HtmlResponse这个类是继承TextResponse,TextResponse同时又继承了Response,因此HtmlResponse也就具有了父类里面的方法(text,xpath)等

说明:上面这个举例简单的打印了返回的response里面的信息以及response的类型,但是在这里面我们需要注意几点:

  • 对于chouti这个网站来说设置了反爬虫的协议,因此我们在setting里面需要将ROBOTSTXT_OBEY这个配置改为Flase,默认为True,如果改为Flase之后那么就不会遵循爬虫协议
  • 对于回调函数返回的response其实是HtmlResponse这个类的对象,response是一个封装了所有请求的响应信息的对象
  • 执行scrapy crawl chouti与scrapy crawl chouti --log的区别?前面是打印出来日志,后面是不打印出来日志

3.scrapy基本原理了解(后面会详细的介绍)

scrapy是一个基于事件循环的异步非阻塞的框架:基于twisted实现的一个框架,内部是基于事件循环的机制实现爬虫的并发

原来的我是这样实现同时爬取多个url的:将请求一个一个发出去

url_list = [‘http://www.baidu.com‘,‘http://www.baidu.com‘,‘http://www.baidu.com‘,]

for item in url_list:
    response = requests.get(item)
    print(response.text)

现在的我可以通过这种方式来实现:

from twisted.web.client import getPage, defer
from twisted.internet import reactor

# 第一部分:代理开始接收任务
def callback(contents):
    print(contents)

deferred_list = []
# 我需要请求的列表
url_list = [‘http://www.bing.com‘, ‘https://segmentfault.com/‘, ‘https://stackoverflow.com/‘]
for url in url_list:
    # 不会马上就将请求发出去,而是生成一个对象
    deferred = getPage(bytes(url, encoding=‘utf8‘))
    # 请求结束调用的回调函数
    deferred.addCallback(callback)
    # 将所有的对象放入一个列表里面
    deferred_list.append(deferred)

# 第二部分:代理执行完任务后,停止
dlist = defer.DeferredList(deferred_list)

def all_done(arg):
    reactor.stop()

dlist.addBoth(all_done)

# 第三部分:代理开始去处理吧
reactor.run()

4.持久化

我们传统的持久化一般来说都是将爬到的数据写到文件里面,这样做有很多缺点:比如

  • 无法完成爬虫刚开始:打开连接; 爬虫关闭时:关闭连接;
  • 分工不明确

传统的持久化写文件:

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3
 4
 5 class ChoutiSpider(scrapy.Spider):
 6     # 爬虫的名称
 7     name = ‘chouti‘
 8     # 定向爬虫,只能爬取域名是gig.chouti.com的网站才能爬取
 9     allowed_domains = [‘dig.chouti.com‘]
10     # 起始的url
11     start_urls = [‘http://dig.chouti.com/‘]
12
13     # 回调函数,起始的url执行结束之后就会自动去调这个函数
14     def parse(self, response):
15         f = open(‘news.log‘, mode=‘a+‘)
16         item_list = response.xpath(‘//div[@id="content-list"]/div[@class="item"]‘)
17         for item in item_list:
18             text = item.xpath(‘.//a/text()‘).extract_first()
19             href = item.xpath(‘.//a/@href‘).extract_first()
20             print(href, text.strip())
21             f.write(href + ‘\n‘)
22         f.close()

这个时候scrapy里面的两个重要的模块就要上场了:分别是 Item与Pipelines

用scrapy实现pipeline步骤:

a.在setting配置文件中配置ITEM_PIPELINES:在这里可以写多个,后面的数字代表优先级,数字越小代表越优先

ITEM_PIPELINES = {
    ‘cjk.pipelines.CjkPipeline‘: 300,
}

b.当我们配置好了ITEM_PIPELINES之后其实我们的pipelines.py文件里面的def process_item(self, item, spider)这个方法会被自动的触发,但是我们仅仅是修改这个我们会发现运行爬虫的时候并没有打印出我们想要的信息chenjunkan;

class CjkscrapyPipeline(object):
    def process_item(self, item, spider):
        print("chenjunkan")
        return item

c.那么我们先要在Item.py这个文件里面新增两个字段,用来约束只需要传这两个字段就行

import scrapy

class CjkscrapyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    href = scrapy.Field()
    title = scrapy.Field()

然后在爬虫里面新增:yield CjkscrapyItem(href=href, title=text),实例化CjkItem这个类生成一个对象,给这个对象传入两个参数,CjkItem这个类里面的两个参数就是用来接收这两个参数的

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from cjkscrapy.items import CjkscrapyItem
 4
 5
 6 class ChoutiSpider(scrapy.Spider):
 7     # 爬虫的名称
 8     name = ‘chouti‘
 9     # 定向爬虫,只能爬取域名是gig.chouti.com的网站才能爬取
10     allowed_domains = [‘dig.chouti.com‘]
11     # 起始的url
12     start_urls = [‘http://dig.chouti.com/‘]
13
14     # 回调函数,起始的url执行结束之后就会自动去调这个函数
15     def parse(self, response):
16         item_list = response.xpath(‘//div[@id="content-list"]/div[@class="item"]‘)
17         for item in item_list:
18             text = item.xpath(‘.//a/text()‘).extract_first()
19             href = item.xpath(‘.//a/@href‘).extract_first()
20             yield CjkscrapyItem(href=href, title=text)

d.接下来pipelines里面的process_item这个方法就会被自动的触发,每yeild一次,process_item这个方法就被触发一次;这个里面的item就是我们通过CjkItem创建的对象;那么为什么我们需要用CjkItem,因为我们可以通过这个类来给我们CjkPipeline约束要取什么数据进行持久化,item给我们约定了什么字段我们就去取什么字段;那么def process_item(self, item, spider)这个方法里面的spider值得是什么呢?这个其实就是实例化的爬虫,因为class ChoutiSpider(scrapy.Spider):这个爬虫要想执行的话首先必须先实例化,说白了就是我们当前爬虫的对象,这个对象里面有name,allowed_domains等一些属性

e.因此我们一个爬虫当yield CjkItem的时候就是交给我们的pipelines去处理,如果我们yield Request那么就是重新去下载

总结:

 1 a.先写pipeline类
 2
 3
 4 class XXXPipeline(object):
 5     def process_item(self, item, spider):
 6         return item
 7
 8
 9 b.写Item类
10
11
12 class XdbItem(scrapy.Item):
13     href = scrapy.Field()
14     title = scrapy.Field()
15
16
17 c.配置
18 ITEM_PIPELINES = {
19     ‘xdb.pipelines.XdbPipeline‘: 300,
20 }
21
22 d.爬虫,yield每执行一次,process_item就调用一次。
23
24 yield Item对象

现在我们有了初步的了解持久化了,但是我们还是存在点问题,就是每yeild一次就会触发process_item这个方法一次,假设我们在process_item里面要不断的打开和关闭连接,这就对性能上影响比较大了。示例:

def process_item(self, item, spider):
        f = open("xx.log", "a+")
        f.write(item["href"]+"\n")
        f.close()
        return item

那么在CjkscrapyPipeline这个类里面其实还有两个方法open_spider和close_spider,这样我们就可以将打开连接的操作放在open_spider里面,将关闭连接的操作放在close_spider里面,避免重复的去操作打开关闭:

class CjkscrapyPipeline(object):

    def open_spider(self, spider):
        print("爬虫开始了")
        self.f = open("new.log", "a+")

    def process_item(self, item, spider):
        self.f.write(item["href"] + "\n")

        return item

    def close_spider(self, spider):
        self.f.close()
        print("爬虫结束了")

我们上面确实是完成了功能,但是仔细看看我们上面的这种写法是不是有点不是很规范,因此:

class CjkscrapyPipeline(object):
    def __init__(self):
        self.f = None

    def open_spider(self, spider):
        print("爬虫开始了")
        self.f = open("new.log", "a+")

    def process_item(self, item, spider):
        self.f.write(item["href"] + "\n")

        return item

    def close_spider(self, spider):
        self.f.close()
        print("爬虫结束了")

再看上面的代码,我么会发现我们将写文件的目录放在了程序中,那么我们能不能在配置文件中去配置呢?所以在CjkscrapyPipeline这个类里面有一个方法:from_crawler,这是一个类方法

@classmethod
    def from_crawler(cls, crawler):
        print(‘File.from_crawler‘)
        path = crawler.settings.get(‘HREF_FILE_PATH‘)
        return cls(path)

说明:crawler.settings.get(‘HREF_FILE_PATH‘)代表去所有的配置文件中去找HREF_FILE_PATH,这个方法里面的cls指的是当前类CjkscrapyPipeline,返回的cls(path在初始化的时候传入path):

class CjkscrapyPipeline(object):
    def __init__(self, path):
        self.f = None
        self.path = path

    @classmethod
    def from_crawler(cls, crawler):
        print(‘File.from_crawler‘)
        path = crawler.settings.get(‘HREF_FILE_PATH‘)
        return cls(path)

    def open_spider(self, spider):
        print("爬虫开始了")
        self.f = open(self.path, "a+")

    def process_item(self, item, spider):
        self.f.write(item["href"] + "\n")

        return item

    def close_spider(self, spider):
        self.f.close()
        print("爬虫结束了")

那么现在我们知道在CjkPipeline这个类里面有5个方法,他们的执行顺序是什么样的呢?

"""
源码内容:
    1. 判断当前CjkPipeline类中是否有from_crawler
        有:
            obj = CjkPipeline.from_crawler(....)
        否:
            obj = CjkPipeline()
    2. obj.open_spider()

    3. obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item()

    4. obj.close_spider()
"""

说明:首先先会判断是否有from_crawler这个方法,如果没有的话,那么就会直接实例化CjkscrapyPipeline这个类生成对对象,如果有的话,那么先执行from_crawler这个方法,他会调用settting文件,因为这个方法的返回值是实例化这个类的对象,实例化的时候需要调用__init__方法

在pipelines里面有一个return item,这个是干嘛用的呢?

from scrapy.exceptions import DropItem

# return item # 交给下一个pipeline的process_item方法

# raise DropItem()# 后续的 pipeline的process_item方法不再执行

注意:pipeline是所有爬虫公用,如果想要给某个爬虫定制需要使用spider参数自己进行处理。

# if spider.name == ‘chouti‘:

5.去重规则

我拿上面之前的代码举例:

# -*- coding: utf-8 -*-
import scrapy
from cjk.items import CjkItem

class ChoutiSpider(scrapy.Spider):
    name = ‘chouti‘
    allowed_domains = [‘dig.chouti.com‘]
    start_urls = [‘http://dig.chouti.com/‘]

    def parse(self, response):
        print(response.request.url)

        # item_list = response.xpath(‘//div[@id="content-list"]/div[@class="item"]‘)
        # for item in item_list:
        #     text = item.xpath(‘.//a/text()‘).extract_first()
        #     href = item.xpath(‘.//a/@href‘).extract_first()

        page_list = response.xpath(‘//div[@id="dig_lcpage"]//a/@href‘).extract()
        for page in page_list:
            from scrapy.http import Request
            page = "https://dig.chouti.com" + page
            yield Request(url=page, callback=self.parse)  # https://dig.chouti.com/all/hot/recent/2

说明:print(response.request.url)这个打印出来的是当前执行的url,当我们执行上面的代码的时候,可以发现从1~120页没有重复的,其实是scrapy内部自己做了去重的,也就是相当于在内存里面设置了一个集合,因为集合是不可重复的

查看源码:

导入:from scrapy.dupefilter import RFPDupeFilter

RFPDupeFilter这是一个类:

 1 class RFPDupeFilter(BaseDupeFilter):
 2     """Request Fingerprint duplicates filter"""
 3
 4     def __init__(self, path=None, debug=False):
 5         self.file = None
 6         self.fingerprints = set()
 7         self.logdupes = True
 8         self.debug = debug
 9         self.logger = logging.getLogger(__name__)
10         if path:
11             self.file = open(os.path.join(path, ‘requests.seen‘), ‘a+‘)
12             self.file.seek(0)
13             self.fingerprints.update(x.rstrip() for x in self.file)
14
15     @classmethod
16     def from_settings(cls, settings):
17         debug = settings.getbool(‘DUPEFILTER_DEBUG‘)
18         return cls(job_dir(settings), debug)
19
20     def request_seen(self, request):
21         fp = self.request_fingerprint(request)
22         if fp in self.fingerprints:
23             return True
24         self.fingerprints.add(fp)
25         if self.file:
26             self.file.write(fp + os.linesep)
27
28     def request_fingerprint(self, request):
29         return request_fingerprint(request)
30
31     def close(self, reason):
32         if self.file:
33             self.file.close()
34
35     def log(self, request, spider):
36         if self.debug:
37             msg = "Filtered duplicate request: %(request)s"
38             self.logger.debug(msg, {‘request‘: request}, extra={‘spider‘: spider})
39         elif self.logdupes:
40             msg = ("Filtered duplicate request: %(request)s"
41                    " - no more duplicates will be shown"
42                    " (see DUPEFILTER_DEBUG to show all duplicates)")
43             self.logger.debug(msg, {‘request‘: request}, extra={‘spider‘: spider})
44             self.logdupes = False
45
46         spider.crawler.stats.inc_value(‘dupefilter/filtered‘, spider=spider)

在这个类里面有一个非常重要的方法:request_seen,这个方法直接确定了是否已经访问过;当我们在yeild Request的时候在内部先会调用一下request_seen这个方法,判断是否已经访问过了:单独来看这个方法

def request_seen(self, request):
        fp = self.request_fingerprint(request)
        if fp in self.fingerprints:
            return True
        self.fingerprints.add(fp)
        if self.file:
            self.file.write(fp + os.linesep)

说明:a.首先执行fp = self.request_fingerprint(request),这个里面的request就是我们传进来的request,request里面应该有url,因为每个url的长短不一,因此我们可以将fp看成是这个url被request_fingerprint这个方法封装过得一个md5值,这样每个url的长度是一样的,便于操作;

b.执行if fp in self.fingerprints: self.fingerprints在__init__初始化的时候其实是一个集合:self.fingerprints = set();如果fp在这个集合里面那么返回的值是true,如果返回的是true,那么就代表这个url被访问过了,那么再次访问的时候就不在访问了

c.如果没有被访问过就执行self.fingerprints.add(fp),将fp加入到这个集合里面

d.我们将访问过得放在内存里面,但是也可放在文件里面执行:if self.file: self.file.write(fp + os.linesep)

如何写到文件中呢?

 1 1.首先执行from_setting去配置文件找路径,要在配置文件里面将debug设置为True;
 2 @classmethod
 3     def from_settings(cls, settings):
 4         debug = settings.getbool(‘DUPEFILTER_DEBUG‘)
 5         return cls(job_dir(settings), debug)
 6 2.返回的是return cls(job_dir(settings), debug),
 7 import os
 8
 9 def job_dir(settings):
10     path = settings[‘JOBDIR‘]
11     if path and not os.path.exists(path):
12         os.makedirs(path)
13     return path
14 如果有path那就从setting里面去取,设置JOBDIR,如果没有的话那么创建,最后返回的是path,也就是配置文件里面的JOBDIR对应的值

但是我们一般的话写在文件里面的话意义不是很大,我后面会介绍将其写在redis里面

自定义去重规则:因为内置的去重规则RFPDupeFilter这个类是继承BaseDupeFilter的,因此我们自定义的时候也可以继承这个基类

from scrapy.dupefilter import BaseDupeFilter
from scrapy.utils.request import request_fingerprint

class CjkDupeFilter(BaseDupeFilter):

    def __init__(self):
        self.visited_fd = set()

    @classmethod
    def from_settings(cls, settings):
        return cls()

    def request_seen(self, request):
        fd = request_fingerprint(request=request)
        if fd in self.visited_fd:
            return True
        self.visited_fd.add(fd)

    def open(self):  # can return deferred
        print(‘开始‘)

    def close(self, reason):  # can return a deferred
        print(‘结束‘)

    # def log(self, request, spider):  # log that a request has been filtered
    #     print(‘日志‘)

系统默认的去重规则配置:DUPEFILTER_CLASS = ‘scrapy.dupefilter.RFPDupeFilter‘

我们需要在setting配置文件里面将我们自定义的去重规则写上:DUPEFILTER_CLASS = ‘cjkscrapy.dupefilters.CjkDupeFilter‘

eg:后面我们会在redis里面做去重规则

注意:在我们的爬虫文件也可以自己设置去重规则,在Request里面有个dont_filter这个参数,默认为False表示遵循过滤规则,如果将dont_filter这个参数设置为True那么表示不遵循去重规则

class Request(object_ref):

    def __init__(self, url, callback=None, method=‘GET‘, headers=None, body=None,
                 cookies=None, meta=None, encoding=‘utf-8‘, priority=0,
                 dont_filter=False, errback=None, flags=None):

6.depth深度控制(后面会详细说明)

# -*- coding: utf-8 -*-
import scrapy
from scrapy.dupefilter import RFPDupeFilter

class ChoutiSpider(scrapy.Spider):
    name = ‘chouti‘
    allowed_domains = [‘dig.chouti.com‘]
    start_urls = [‘http://dig.chouti.com/‘]

    def parse(self, response):
        print(response.request.url, response.meta.get("depth", 0))

        # item_list = response.xpath(‘//div[@id="content-list"]/div[@class="item"]‘)
        # for item in item_list:
        #     text = item.xpath(‘.//a/text()‘).extract_first()
        #     href = item.xpath(‘.//a/@href‘).extract_first()

        page_list = response.xpath(‘//div[@id="dig_lcpage"]//a/@href‘).extract()
        for page in page_list:
            from scrapy.http import Request
            page = "https://dig.chouti.com" + page
            yield Request(url=page, callback=self.parse, dont_filter=False)  # https://dig.chouti.com/all/hot/recent/2

深度指的是爬虫爬取的深度,如果我们想控制深度,那么我们可以在setting配置文件里面设置DEPTH_LIMIT = 3

7.手动处理cookie

Request对象里面参数查看:

class Request(object_ref):

    def __init__(self, url, callback=None, method=‘GET‘, headers=None, body=None,
                 cookies=None, meta=None, encoding=‘utf-8‘, priority=0,
                 dont_filter=False, errback=None, flags=None):

我们暂时先关注Request里面的这个参数url, callback=None, method=‘GET‘, headers=None, body=None,cookies=None;发请求其实就相当于请求头+请求体+cookie

自动登录抽屉示例:

我们第一次请求抽屉的时候我们会得到一个cookie,我们如果去响应里面去拿到cookie呢?

首先导入一个处理cookie的类:from scrapy.http.cookies import CookieJar

# -*- coding: utf-8 -*-
import scrapy
from scrapy.dupefilter import RFPDupeFilter
from scrapy.http.cookies import CookieJar

class ChoutiSpider(scrapy.Spider):
    name = ‘chouti‘
    allowed_domains = [‘dig.chouti.com‘]
    start_urls = [‘http://dig.chouti.com/‘]

    def parse(self, response):
        cookie_dict = {}

        # 去响应头中获取cookie,cookie保存在cookie_jar对象
        cookie_jar = CookieJar()
        # 去response, response.request里面拿到cookie
        cookie_jar.extract_cookies(response, response.request)

        # 去对象中将cookie解析到字典
        for k, v in cookie_jar._cookies.items():
            for i, j in v.items():
                for m, n in j.items():
                    cookie_dict[m] = n.value
        print(cookie_dict)

运行上面的代码:我们能拿到cookie

chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog
/Users/chenjunkan/Desktop/scrapytest/cjkscrapy/cjkscrapy/spiders/chouti.py:3: ScrapyDeprecationWarning: Module `scrapy.dupefilter` is deprecated, use `scrapy.dupefilters` instead
  from scrapy.dupefilter import RFPDupeFilter
{‘gpsd‘: ‘d052f4974404d8c431f3c7c1615694c4‘, ‘JSESSIONID‘: ‘aaaUxbbxMYOWh4T7S7rGw‘}

自动登录抽屉并且点赞的完整示例:

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy.dupefilter import RFPDupeFilter
 4 from scrapy.http.cookies import CookieJar
 5 from scrapy.http import Request
 6
 7
 8 class ChoutiSpider(scrapy.Spider):
 9     name = ‘chouti‘
10     allowed_domains = [‘dig.chouti.com‘]
11     start_urls = [‘http://dig.chouti.com/‘]
12     cookie_dict = {}
13
14     def parse(self, response):
15
16         # 去响应头中获取cookie,cookie保存在cookie_jar对象
17         cookie_jar = CookieJar()
18         # 去response, response.request里面拿到cookie
19         cookie_jar.extract_cookies(response, response.request)
20
21         # 去对象中将cookie解析到字典
22         for k, v in cookie_jar._cookies.items():
23             for i, j in v.items():
24                 for m, n in j.items():
25                     self.cookie_dict[m] = n.value
26         print(self.cookie_dict)
27
28         yield Request(
29             url=‘https://dig.chouti.com/login‘,
30             method=‘POST‘,
31             body="phone=8618357186730&password=cjk139511&oneMonth=1",
32             cookies=self.cookie_dict,
33             headers={
34                 ‘Content-Type‘: ‘application/x-www-form-urlencoded; charset=UTF-8‘
35             },
36             callback=self.check_login
37         )
38
39     def check_login(self, response):
40         print(response.text)
41         yield Request(
42             url=‘https://dig.chouti.com/all/hot/recent/1‘,
43             cookies=self.cookie_dict,
44             callback=self.index
45         )
46
47     def index(self, response):
48         news_list = response.xpath(‘//div[@id="content-list"]/div[@class="item"]‘)
49         for new in news_list:
50             link_id = new.xpath(‘.//div[@class="part2"]/@share-linkid‘).extract_first()
51             yield Request(
52                 url=‘http://dig.chouti.com/link/vote?linksId=%s‘ % (link_id,),
53                 method=‘POST‘,
54                 cookies=self.cookie_dict,
55                 callback=self.check_result
56             )
57         # 所有的页面全部点赞
58         page_list = response.xpath(‘//div[@id="dig_lcpage"]//a/@href‘).extract()
59         for page in page_list:
60             page = "https://dig.chouti.com" + page
61             yield Request(url=page, callback=self.index)  # https://dig.chouti.com/all/hot/recent/2
62
63     def check_result(self, response):
64         print(response.text)

运行结果:

 1 chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog
 2 /Users/chenjunkan/Desktop/scrapytest/cjkscrapy/cjkscrapy/spiders/chouti.py:3: ScrapyDeprecationWarning: Module `scrapy.dupefilter` is deprecated, use `scrapy.dupefilters` instead
 3   from scrapy.dupefilter import RFPDupeFilter
 4 {‘gpsd‘: ‘78613f08c985435d5d0eedc08b0ed812‘, ‘JSESSIONID‘: ‘aaaTmxnFAJJGn9Muf8rGw‘}
 5 {"result":{"code":"9999", "message":"", "data":{"complateReg":"0","destJid":"cdu_53587312848"}}}
 6 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112533000","lvCount":"6","nick":"chenjunkan","uvCount":"26","voteTime":"小于1分钟前"}}}
 7 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112720000","lvCount":"28","nick":"chenjunkan","uvCount":"27","voteTime":"小于1分钟前"}}}
 8 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112862000","lvCount":"24","nick":"chenjunkan","uvCount":"33","voteTime":"小于1分钟前"}}}
 9 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112849000","lvCount":"29","nick":"chenjunkan","uvCount":"33","voteTime":"小于1分钟前"}}}
10 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112872000","lvCount":"48","nick":"chenjunkan","uvCount":"33","voteTime":"小于1分钟前"}}}
11 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112877000","lvCount":"23","nick":"chenjunkan","uvCount":"33","voteTime":"小于1分钟前"}}}
12 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112877000","lvCount":"69","nick":"chenjunkan","uvCount":"33","voteTime":"小于1分钟前"}}}
13 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112877000","lvCount":"189","nick":"chenjunkan","uvCount":"33","voteTime":"小于1分钟前"}}}
14 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112926000","lvCount":"98","nick":"chenjunkan","uvCount":"35","voteTime":"小于1分钟前"}}}
15 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112951000","lvCount":"61","nick":"chenjunkan","uvCount":"35","voteTime":"小于1分钟前"}}}
16 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113086000","lvCount":"13","nick":"chenjunkan","uvCount":"37","voteTime":"小于1分钟前"}}}
17 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113097000","lvCount":"17","nick":"chenjunkan","uvCount":"38","voteTime":"小于1分钟前"}}}
18 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113118000","lvCount":"21","nick":"chenjunkan","uvCount":"41","voteTime":"小于1分钟前"}}}
19 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113155000","lvCount":"86","nick":"chenjunkan","uvCount":"41","voteTime":"小于1分钟前"}}}
20 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113140000","lvCount":"22","nick":"chenjunkan","uvCount":"41","voteTime":"小于1分钟前"}}}
21 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113148000","lvCount":"25","nick":"chenjunkan","uvCount":"41","voteTime":"小于1分钟前"}}}
22 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113416000","lvCount":"22","nick":"chenjunkan","uvCount":"47","voteTime":"小于1分钟前"}}}
23 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113386000","lvCount":"13","nick":"chenjunkan","uvCount":"46","voteTime":"小于1分钟前"}}}
24 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113381000","lvCount":"70","nick":"chenjunkan","uvCount":"46","voteTime":"小于1分钟前"}}}
25 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113408000","lvCount":"27","nick":"chenjunkan","uvCount":"47","voteTime":"小于1分钟前"}}}
26 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113402000","lvCount":"17","nick":"chenjunkan","uvCount":"47","voteTime":"小于1分钟前"}}}
27 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113428000","lvCount":"41","nick":"chenjunkan","uvCount":"47","voteTime":"小于1分钟前"}}}
28 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113528000","lvCount":"55","nick":"chenjunkan","uvCount":"49","voteTime":"小于1分钟前"}}}
29 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113544000","lvCount":"16","nick":"chenjunkan","uvCount":"49","voteTime":"小于1分钟前"}}}
30 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113643000","lvCount":"22","nick":"chenjunkan","uvCount":"50","voteTime":"小于1分钟前"}}}

8.起始请求定制

在说起始url之前我们先来了解一个可迭代对象和迭代器,什么是可迭代对象?列表就是可迭代对象

# 可迭代对象
# li = [11,22,33]

# 可迭代对象转换为迭代器
# iter(li)

生成器转换为迭代器,生成器是一种特殊的迭代器

def func():
    yield 11
    yield 22
    yield 33

# 生成器
li = func()

# 生成器 转换为迭代器
iter(li)

v = li.__next__()
print(v)

我们知道爬虫里面的调度器只能接收request对象,然后交给下载器去下载,因此当爬虫启动的时候在源码的内部开始不是先去执行parse方法,而是先把start_urls里面的每一个url都封装成一个一个的request对象,然后将request对象交给调度器;具体流程如下:

"""
scrapy引擎来爬虫中取起始URL:
    1. 调用start_requests并获取返回值
    2. v = iter(返回值) 返回值是生成器或者迭代器都没关系,因为会转换为迭代器
    3.
        req1 = 执行 v.__next__()
        req2 = 执行 v.__next__()
        req3 = 执行 v.__next__()
        ...
    4. req全部放到调度器中

"""

说明:当调度器来拿request对象的时候并不是直接通过start_urls拿,而是调用内部的start_requests这个方法,如果本类没有的话就去父类拿

start_requests方法的两种方式:

方式一:start_requests方法是一个生成器函数,在内部会将这个生成器转换为迭代器

def start_requests(self):
    # 方式一:
    for url in self.start_urls:
        yield Request(url=url)

方式二:start_requests方法返回一个可迭代的对象

def start_requests(self):
    # 方式二:
    req_list = []
    for url in self.start_urls:
        req_list.append(Request(url=url))
    return req_list

当我们自己不定义start_requests的时候其实我们的源码里面也是有这个方法的:进入scrapy.Spider

查看源码:

    def start_requests(self):
        cls = self.__class__
        if method_is_overridden(cls, Spider, ‘make_requests_from_url‘):
            warnings.warn(
                "Spider.make_requests_from_url method is deprecated; it "
                "won‘t be called in future Scrapy releases. Please "
                "override Spider.start_requests method instead (see %s.%s)." % (
                    cls.__module__, cls.__name__
                ),
            )
            for url in self.start_urls:
                yield self.make_requests_from_url(url)
        else:
            for url in self.start_urls:
                yield Request(url, dont_filter=True)

如果我们自己定义了start_requests,那么我们自己的优先级会更高,默认的情况下发的是get请求,那么这样我们可以自己改成post的请求

eg:起始的url可以去redis中获取

9.深度和优先级

# -*- coding: utf-8 -*-
import scrapy
from scrapy.dupefilter import RFPDupeFilter
from scrapy.http.cookies import CookieJar
from scrapy.http import Request

class ChoutiSpider(scrapy.Spider):
    name = ‘chouti‘
    allowed_domains = [‘dig.chouti.com‘]
    start_urls = [‘http://dig.chouti.com/‘]

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url=url, callback=self.parse)

    def parse(self, response):
        print(response.meta.get("depth", 0))

        page_list = response.xpath(‘//div[@id="dig_lcpage"]//a/@href‘).extract()
        for page in page_list:
            page = "https://dig.chouti.com" + page
            yield Request(url=page, callback=self.parse)  # https://dig.chouti.com/all/hot/recent/2

当我们在执行yield Request的时候其实我们将request对象放在调度器中去,然后交给下载器进行下载,其中还需要进过很多中间件,因此在中间件里面还可以做点其他的操作:导入from scrapy.spidermiddlewares.depth import DepthMiddleware

DepthMiddleware:

class DepthMiddleware(object):

    def __init__(self, maxdepth, stats=None, verbose_stats=False, prio=1):
        self.maxdepth = maxdepth
        self.stats = stats
        self.verbose_stats = verbose_stats
        self.prio = prio

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        maxdepth = settings.getint(‘DEPTH_LIMIT‘)
        verbose = settings.getbool(‘DEPTH_STATS_VERBOSE‘)
        prio = settings.getint(‘DEPTH_PRIORITY‘)
        return cls(maxdepth, crawler.stats, verbose, prio)

    def process_spider_output(self, response, result, spider):
        def _filter(request):
            if isinstance(request, Request):
                depth = response.meta[‘depth‘] + 1
                request.meta[‘depth‘] = depth
                if self.prio:
                    request.priority -= depth * self.prio
                if self.maxdepth and depth > self.maxdepth:
                    logger.debug(
                        "Ignoring link (depth > %(maxdepth)d): %(requrl)s ",
                        {‘maxdepth‘: self.maxdepth, ‘requrl‘: request.url},
                        extra={‘spider‘: spider}
                    )
                    return False
                elif self.stats:
                    if self.verbose_stats:
                        self.stats.inc_value(‘request_depth_count/%s‘ % depth,
                                             spider=spider)
                    self.stats.max_value(‘request_depth_max‘, depth,
                                         spider=spider)
            return True

        # base case (depth=0)
        if self.stats and ‘depth‘ not in response.meta:
            response.meta[‘depth‘] = 0
            if self.verbose_stats:
                self.stats.inc_value(‘request_depth_count/0‘, spider=spider)

        return (r for r in result or () if _filter(r))

当经过中间件的时候穿过的就是process_spider_output方法:

说明:

1.if self.stats and ‘depth‘ not in response.meta:
第一次将请求发送过去的时候Request里面的meta=None;返回的reponse里面的meta也是空的,在response里面response.request中的request就是我们开始封装的request;导入from scrapy.http import response,我们会发现response.meta:
@property
    def meta(self):
        try:
            return self.request.meta
        except AttributeError:
            raise AttributeError(
                "Response.meta not available, this response "
                "is not tied to any request"
            )
可以得出:response.meta = response.request.meta
2.当我们yield Request的时候这个时候就会触发process_spider_output这个函数,当条件成立之后就会给request.meta赋值:response.meta[‘depth‘] = 0
3.return (r for r in result or () if _filter(r)):result指的是返回的一个一个的request对象,循环每一个request对象,执行_filter方法
4.如果_filter返回True,表示这个请求是允许的,交给调度器,如果返回False,那么就丢弃
            if isinstance(request, Request):
                depth = response.meta[‘depth‘] + 1
                request.meta[‘depth‘] = depth
如果request是Request对象,那么就深度+1;
5.if self.maxdepth and depth > self.maxdepth:  maxdepth = settings.getint(‘DEPTH_LIMIT‘)如果深度大于我们自己设置的深度,那么就丢弃

优先级策略:深度越深那么优先级就越低,默认的优先级为0

if self.prio:
    request.priority -= depth * self.prio

10.代理

scrapy内置代理:当调度器将url交给下载器进行下载的时候,那么中间会进过中间件,内置的代理就在这些中间件里面

路径:/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/downloadermiddlewares/httpproxy.py

HttpProxyMiddleware类:

 1 import base64
 2 from six.moves.urllib.request import getproxies, proxy_bypass
 3 from six.moves.urllib.parse import unquote
 4 try:
 5     from urllib2 import _parse_proxy
 6 except ImportError:
 7     from urllib.request import _parse_proxy
 8 from six.moves.urllib.parse import urlunparse
 9
10 from scrapy.utils.httpobj import urlparse_cached
11 from scrapy.exceptions import NotConfigured
12 from scrapy.utils.python import to_bytes
13
14
15 class HttpProxyMiddleware(object):
16
17     def __init__(self, auth_encoding=‘latin-1‘):
18         self.auth_encoding = auth_encoding
19         self.proxies = {}
20         for type, url in getproxies().items():
21             self.proxies[type] = self._get_proxy(url, type)
22
23     @classmethod
24     def from_crawler(cls, crawler):
25         if not crawler.settings.getbool(‘HTTPPROXY_ENABLED‘):
26             raise NotConfigured
27         auth_encoding = crawler.settings.get(‘HTTPPROXY_AUTH_ENCODING‘)
28         return cls(auth_encoding)
29
30     def _basic_auth_header(self, username, password):
31         user_pass = to_bytes(
32             ‘%s:%s‘ % (unquote(username), unquote(password)),
33             encoding=self.auth_encoding)
34         return base64.b64encode(user_pass).strip()
35
36     def _get_proxy(self, url, orig_type):
37         proxy_type, user, password, hostport = _parse_proxy(url)
38         proxy_url = urlunparse((proxy_type or orig_type, hostport, ‘‘, ‘‘, ‘‘, ‘‘))
39
40         if user:
41             creds = self._basic_auth_header(user, password)
42         else:
43             creds = None
44
45         return creds, proxy_url
46
47     def process_request(self, request, spider):
48         # ignore if proxy is already set
49         if ‘proxy‘ in request.meta:
50             if request.meta[‘proxy‘] is None:
51                 return
52             # extract credentials if present
53             creds, proxy_url = self._get_proxy(request.meta[‘proxy‘], ‘‘)
54             request.meta[‘proxy‘] = proxy_url
55             if creds and not request.headers.get(‘Proxy-Authorization‘):
56                 request.headers[‘Proxy-Authorization‘] = b‘Basic ‘ + creds
57             return
58         elif not self.proxies:
59             return
60
61         parsed = urlparse_cached(request)
62         scheme = parsed.scheme
63
64         # ‘no_proxy‘ is only supported by http schemes
65         if scheme in (‘http‘, ‘https‘) and proxy_bypass(parsed.hostname):
66             return
67
68         if scheme in self.proxies:
69             self._set_proxy(request, scheme)
70
71     def _set_proxy(self, request, scheme):
72         creds, proxy = self.proxies[scheme]
73         request.meta[‘proxy‘] = proxy
74         if creds:
75             request.headers[‘Proxy-Authorization‘] = b‘Basic ‘ + creds

    def process_request(self, request, spider):
        # ignore if proxy is already set
        if ‘proxy‘ in request.meta:
            if request.meta[‘proxy‘] is None:
                return
            # extract credentials if present
            creds, proxy_url = self._get_proxy(request.meta[‘proxy‘], ‘‘)
            request.meta[‘proxy‘] = proxy_url
            if creds and not request.headers.get(‘Proxy-Authorization‘):
                request.headers[‘Proxy-Authorization‘] = b‘Basic ‘ + creds
            return
        elif not self.proxies:
            return

        parsed = urlparse_cached(request)
        scheme = parsed.scheme

        # ‘no_proxy‘ is only supported by http schemes
        if scheme in (‘http‘, ‘https‘) and proxy_bypass(parsed.hostname):
            return

        if scheme in self.proxies:
            self._set_proxy(request, scheme)

说明:

        if scheme in self.proxies:
            self._set_proxy(request, scheme)

self.proxy是什么?是一个self.proxies = {}空的字典:加代理的本质就是在请求头上加点数据

        for type, url in getproxies().items():
            self.proxies[type] = self._get_proxy(url, type)

接下来:getproxies = getproxies_environment

def getproxies_environment():
    """Return a dictionary of scheme -> proxy server URL mappings.

    Scan the environment for variables named <scheme>_proxy;
    this seems to be the standard convention.  If you need a
    different way, you can pass a proxies dictionary to the
    [Fancy]URLopener constructor.

    """
    proxies = {}
    # in order to prefer lowercase variables, process environment in
    # two passes: first matches any, second pass matches lowercase only
    for name, value in os.environ.items():
        name = name.lower()
        if value and name[-6:] == ‘_proxy‘:
            proxies[name[:-6]] = value
    # CVE-2016-1000110 - If we are running as CGI script, forget HTTP_PROXY
    # (non-all-lowercase) as it may be set from the web server by a "Proxy:"
    # header from the client
    # If "proxy" is lowercase, it will still be used thanks to the next block
    if ‘REQUEST_METHOD‘ in os.environ:
        proxies.pop(‘http‘, None)
    for name, value in os.environ.items():
        if name[-6:] == ‘_proxy‘:
            name = name.lower()
            if value:
                proxies[name[:-6]] = value
            else:
                proxies.pop(name[:-6], None)
    return proxies

说明:a.通过os.environ从环境变量里面去获取代理,只要遵循后六位是_proxy结尾就可以,也可以通过get方法设置代理

import os

v = os.environ.get(‘XXX‘)
print(v)

b.通过这个方法我们能够得到一个

proxies = {

  http:‘192.168.3.3‘

  https:‘192.168.3.4‘

}

c.接下来self.proxies[type] = self._get_proxy(url, type)

    def _get_proxy(self, url, orig_type):
        proxy_type, user, password, hostport = _parse_proxy(url)
        proxy_url = urlunparse((proxy_type or orig_type, hostport, ‘‘, ‘‘, ‘‘, ‘‘))

        if user:
            creds = self._basic_auth_header(user, password)
        else:
            creds = None

        return creds, proxy_url

内置代理的写法:

在爬虫启动时,提前在os.envrion中设置代理即可。
class ChoutiSpider(scrapy.Spider):
    name = ‘chouti‘
    allowed_domains = [‘chouti.com‘]
    start_urls = [‘https://dig.chouti.com/‘]
    cookie_dict = {}

    def start_requests(self):
        import os
        os.environ[‘HTTPS_PROXY‘] = "http://root:[email protected]:9999/"
        os.environ[‘HTTP_PROXY‘] = ‘19.11.2.32‘,
        for url in self.start_urls:
            yield Request(url=url,callback=self.parse)

但是我们还有另外的一种方式设置内置代理,在爬虫里面去设置meta

源码查看:

        if ‘proxy‘ in request.meta:
            if request.meta[‘proxy‘] is None:
                return
            # extract credentials if present
            creds, proxy_url = self._get_proxy(request.meta[‘proxy‘], ‘‘)
            request.meta[‘proxy‘] = proxy_url
            if creds and not request.headers.get(‘Proxy-Authorization‘):
                request.headers[‘Proxy-Authorization‘] = b‘Basic ‘ + creds
            return
        elif not self.proxies:
            return

meta写法

class ChoutiSpider(scrapy.Spider):
    name = ‘chouti‘
    allowed_domains = [‘chouti.com‘]
    start_urls = [‘https://dig.chouti.com/‘]
    cookie_dict = {}

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url=url,callback=self.parse,meta={‘proxy‘:‘"http://root:[email protected]:9999/"‘})

自定义代理:

代码示例:

import base64
import random
from six.moves.urllib.parse import unquote

try:
    from urllib2 import _parse_proxy
except ImportError:
    from urllib.request import _parse_proxy
from six.moves.urllib.parse import urlunparse
from scrapy.utils.python import to_bytes

class CjkProxyMiddleware(object):

    def _basic_auth_header(self, username, password):
        user_pass = to_bytes(
            ‘%s:%s‘ % (unquote(username), unquote(password)),
            encoding=‘latin-1‘)
        return base64.b64encode(user_pass).strip()

    def process_request(self, request, spider):
        PROXIES = [
            "http://root:[email protected]:9999/",
            "http://root:[email protected]:9999/",
            "http://root:[email protected]:9999/",
            "http://root:[email protected]:9999/",
            "http://root:[email protected]:9999/",
            "http://root:[email protected]:9999/",
        ]
        url = random.choice(PROXIES)

        orig_type = ""
        proxy_type, user, password, hostport = _parse_proxy(url)
        proxy_url = urlunparse((proxy_type or orig_type, hostport, ‘‘, ‘‘, ‘‘, ‘‘))

        if user:
            creds = self._basic_auth_header(user, password)
        else:
            creds = None
        request.meta[‘proxy‘] = proxy_url
        if creds:
            request.headers[‘Proxy-Authorization‘] = b‘Basic ‘ + creds

11.解析器(后面会详细介绍)

代码示例:

html = """<!DOCTYPE html>
<html>
    <head lang="en">
        <meta charset="UTF-8">
        <title></title>
    </head>
    <body>
        <ul>
            <li class="item-"><a id=‘i1‘ href="link.html">first item</a></li>
            <li class="item-0"><a id=‘i2‘ href="llink.html">first item</a></li>
            <li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li>
        </ul>
        <div><a href="llink2.html">second item</a></div>
    </body>
</html>
"""

from scrapy.http import HtmlResponse
from scrapy.selector import Selector

response = HtmlResponse(url=‘http://example.com‘, body=html,encoding=‘utf-8‘)

# hxs = Selector(response)
# hxs.xpath()
response.xpath(‘‘)

12.下载中间件

源码查看:

 1 class CjkscrapyDownloaderMiddleware(object):
 2     # Not all methods need to be defined. If a method is not defined,
 3     # scrapy acts as if the downloader middleware does not modify the
 4     # passed objects.
 5
 6     @classmethod
 7     def from_crawler(cls, crawler):
 8         # This method is used by Scrapy to create your spiders.
 9         s = cls()
10         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
11         return s
12
13     def process_request(self, request, spider):
14         # Called for each request that goes through the downloader
15         # middleware.
16
17         # Must either:
18         # - return None: continue processing this request
19         # - or return a Response object
20         # - or return a Request object
21         # - or raise IgnoreRequest: process_exception() methods of
22         #   installed downloader middleware will be called
23         return None
24
25     def process_response(self, request, response, spider):
26         # Called with the response returned from the downloader.
27
28         # Must either;
29         # - return a Response object
30         # - return a Request object
31         # - or raise IgnoreRequest
32         return response
33
34     def process_exception(self, request, exception, spider):
35         # Called when a download handler or a process_request()
36         # (from other downloader middleware) raises an exception.
37
38         # Must either:
39         # - return None: continue processing this exception
40         # - return a Response object: stops process_exception() chain
41         # - return a Request object: stops process_exception() chain
42         pass
43
44     def spider_opened(self, spider):
45         spider.logger.info(‘Spider opened: %s‘ % spider.name)

自定义下载中间件:先定义一个md文件,然后在setting里面设置

DOWNLOADER_MIDDLEWARES = {
    # ‘cjkscrapy.middlewares.CjkscrapyDownloaderMiddleware‘: 543,
    ‘cjkscrapy.md.Md1‘: 666,
    ‘cjkscrapy.md.Md2‘: 667,
}

md.py

class Md1(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        return s

    def process_request(self, request, spider):
        print(‘md1.process_request‘, request)
        pass

    def process_response(self, request, response, spider):
        print(‘m1.process_response‘, request, response)
        return response

    def process_exception(self, request, exception, spider):
        pass

class Md2(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        return s

    def process_request(self, request, spider):
        print(‘md2.process_request‘, request)

    def process_response(self, request, response, spider):
        print(‘m2.process_response‘, request, response)
        return response

    def process_exception(self, request, exception, spider):
        pass

运行结果:

 1 chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog
 2 md1.process_request <GET http://dig.chouti.com/>
 3 md2.process_request <GET http://dig.chouti.com/>
 4 m2.process_response <GET http://dig.chouti.com/> <301 http://dig.chouti.com/>
 5 m1.process_response <GET http://dig.chouti.com/> <301 http://dig.chouti.com/>
 6 md1.process_request <GET https://dig.chouti.com/>
 7 md2.process_request <GET https://dig.chouti.com/>
 8 m2.process_response <GET https://dig.chouti.com/> <200 https://dig.chouti.com/>
 9 m1.process_response <GET https://dig.chouti.com/> <200 https://dig.chouti.com/>
10 response <200 https://dig.chouti.com/>

a.返回Response对象

from scrapy.http import HtmlResponse
from scrapy.http import Request

class Md1(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        return s

    def process_request(self, request, spider):
        print(‘md1.process_request‘, request)
        # 1. 返回Response
        import requests
        result = requests.get(request.url)
        return HtmlResponse(url=request.url, status=200, headers=None, body=result.content)

    def process_response(self, request, response, spider):
        print(‘m1.process_response‘, request, response)
        return response

    def process_exception(self, request, exception, spider):
        pass

class Md2(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        return s

    def process_request(self, request, spider):
        print(‘md2.process_request‘, request)

    def process_response(self, request, response, spider):
        print(‘m2.process_response‘, request, response)
        return response

    def process_exception(self, request, exception, spider):
        pass

运行结果:

1 chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog
2 md1.process_request <GET http://dig.chouti.com/>
3 m2.process_response <GET http://dig.chouti.com/> <200 http://dig.chouti.com/>
4 m1.process_response <GET http://dig.chouti.com/> <200 http://dig.chouti.com/>
5 response <200 http://dig.chouti.com/>

说明:通过结果可以看出,在这一步可以伪造下载的结果,直接把结果当成返回值交给下一个处理

b.返回Request

from scrapy.http import HtmlResponse
from scrapy.http import Request

class Md1(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        return s

    def process_request(self, request, spider):
        print(‘md1.process_request‘, request)
        # 2. 返回Request
        return Request(‘https://dig.chouti.com/r/tec/hot/1‘)

    def process_response(self, request, response, spider):
        print(‘m1.process_response‘, request, response)
        return response

    def process_exception(self, request, exception, spider):
        pass

class Md2(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        return s

    def process_request(self, request, spider):
        print(‘md2.process_request‘, request)

    def process_response(self, request, response, spider):
        print(‘m2.process_response‘, request, response)
        return response

    def process_exception(self, request, exception, spider):
        pass

运行结果:

chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog
md1.process_request <GET http://dig.chouti.com/>
md1.process_request <GET https://dig.chouti.com/r/tec/hot/1>

c.抛出异常:必须要有process_exception

from scrapy.http import HtmlResponse
from scrapy.http import Request

class Md1(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        return s

    def process_request(self, request, spider):
        print(‘md1.process_request‘, request)
        # 3. 抛出异常
        from scrapy.exceptions import IgnoreRequest
        raise IgnoreRequest

    def process_response(self, request, response, spider):
        print(‘m1.process_response‘, request, response)
        return response

    def process_exception(self, request, exception, spider):
        pass

class Md2(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        return s

    def process_request(self, request, spider):
        print(‘md2.process_request‘, request)

    def process_response(self, request, response, spider):
        print(‘m2.process_response‘, request, response)
        return response

    def process_exception(self, request, exception, spider):
        pass

运行结果:

chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog
md1.process_request <GET http://dig.chouti.com/>

d.上面的几种我们一般都不用,而是对请求进行加工(cookie)

from scrapy.http import HtmlResponse
from scrapy.http import Request

class Md1(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        return s

    def process_request(self, request, spider):
        print(‘md1.process_request‘, request)
        request.headers[
            ‘user-agent‘] = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"

    def process_response(self, request, response, spider):
        print(‘m1.process_response‘, request, response)
        return response

    def process_exception(self, request, exception, spider):
        pass

class Md2(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        return s

    def process_request(self, request, spider):
        print(‘md2.process_request‘, request)

    def process_response(self, request, response, spider):
        print(‘m2.process_response‘, request, response)
        return response

    def process_exception(self, request, exception, spider):
        pass

运行结果:

 1 chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog
 2 md1.process_request <GET http://dig.chouti.com/>
 3 md2.process_request <GET http://dig.chouti.com/>
 4 m2.process_response <GET http://dig.chouti.com/> <301 http://dig.chouti.com/>
 5 m1.process_response <GET http://dig.chouti.com/> <301 http://dig.chouti.com/>
 6 md1.process_request <GET https://dig.chouti.com/>
 7 md2.process_request <GET https://dig.chouti.com/>
 8 m2.process_response <GET https://dig.chouti.com/> <200 https://dig.chouti.com/>
 9 m1.process_response <GET https://dig.chouti.com/> <200 https://dig.chouti.com/>
10 response <200 https://dig.chouti.com/>

但是这个我们不需要自己写,源码里面帮我们写好了:

路径:/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/downloadermiddlewares/useragent.py

源码查看:

"""Set User-Agent header per spider or use a default value from settings"""

from scrapy import signals

class UserAgentMiddleware(object):
    """This middleware allows spiders to override the user_agent"""

    def __init__(self, user_agent=‘Scrapy‘):
        self.user_agent = user_agent

    @classmethod
    def from_crawler(cls, crawler):
        o = cls(crawler.settings[‘USER_AGENT‘])
        crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
        return o

    def spider_opened(self, spider):
        self.user_agent = getattr(spider, ‘user_agent‘, self.user_agent)

    def process_request(self, request, spider):
        if self.user_agent:
            request.headers.setdefault(b‘User-Agent‘, self.user_agent)

13.爬虫中间件

源码查看:

 1 class CjkscrapySpiderMiddleware(object):
 2     # Not all methods need to be defined. If a method is not defined,
 3     # scrapy acts as if the spider middleware does not modify the
 4     # passed objects.
 5
 6     @classmethod
 7     def from_crawler(cls, crawler):
 8         # This method is used by Scrapy to create your spiders.
 9         s = cls()
10         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
11         return s
12
13     def process_spider_input(self, response, spider):
14         # Called for each response that goes through the spider
15         # middleware and into the spider.
16
17         # Should return None or raise an exception.
18         return None
19
20     def process_spider_output(self, response, result, spider):
21         # Called with the results returned from the Spider, after
22         # it has processed the response.
23
24         # Must return an iterable of Request, dict or Item objects.
25         for i in result:
26             yield i
27
28     def process_spider_exception(self, response, exception, spider):
29         # Called when a spider or process_spider_input() method
30         # (from other spider middleware) raises an exception.
31
32         # Should return either None or an iterable of Response, dict
33         # or Item objects.
34         pass
35
36     def process_start_requests(self, start_requests, spider):
37         # Called with the start requests of the spider, and works
38         # similarly to the process_spider_output() method, except
39         # that it doesn’t have a response associated.
40
41         # Must return only requests (not items).
42         for r in start_requests:
43             yield r
44
45     def spider_opened(self, spider):
46         spider.logger.info(‘Spider opened: %s‘ % spider.name)

自定义爬虫中间件:

在配置文件中设置:

SPIDER_MIDDLEWARES = {
    # ‘cjkscrapy.middlewares.CjkscrapySpiderMiddleware‘: 543,
    ‘cjkscrapy.sd.Sd1‘: 667,
    ‘cjkscrapy.sd.Sd2‘: 668,
}

sd.py:

 1 class Sd1(object):
 2     # Not all methods need to be defined. If a method is not defined,
 3     # scrapy acts as if the spider middleware does not modify the
 4     # passed objects.
 5
 6     @classmethod
 7     def from_crawler(cls, crawler):
 8         # This method is used by Scrapy to create your spiders.
 9         s = cls()
10         return s
11
12     def process_spider_input(self, response, spider):
13         # Called for each response that goes through the spider
14         # middleware and into the spider.
15
16         # Should return None or raise an exception.
17         return None
18
19     def process_spider_output(self, response, result, spider):
20         # Called with the results returned from the Spider, after
21         # it has processed the response.
22
23         # Must return an iterable of Request, dict or Item objects.
24         for i in result:
25             yield i
26
27     def process_spider_exception(self, response, exception, spider):
28         # Called when a spider or process_spider_input() method
29         # (from other spider middleware) raises an exception.
30
31         # Should return either None or an iterable of Response, dict
32         # or Item objects.
33         pass
34
35     # 只在爬虫启动时,执行一次。
36     def process_start_requests(self, start_requests, spider):
37         # Called with the start requests of the spider, and works
38         # similarly to the process_spider_output() method, except
39         # that it doesn’t have a response associated.
40
41         # Must return only requests (not items).
42         for r in start_requests:
43             yield r
44
45
46 class Sd2(object):
47     # Not all methods need to be defined. If a method is not defined,
48     # scrapy acts as if the spider middleware does not modify the
49     # passed objects.
50
51     @classmethod
52     def from_crawler(cls, crawler):
53         # This method is used by Scrapy to create your spiders.
54         s = cls()
55         return s
56
57     def process_spider_input(self, response, spider):
58         # Called for each response that goes through the spider
59         # middleware and into the spider.
60
61         # Should return None or raise an exception.
62         return None
63
64     def process_spider_output(self, response, result, spider):
65         # Called with the results returned from the Spider, after
66         # it has processed the response.
67
68         # Must return an iterable of Request, dict or Item objects.
69         for i in result:
70             yield i
71
72     def process_spider_exception(self, response, exception, spider):
73         # Called when a spider or process_spider_input() method
74         # (from other spider middleware) raises an exception.
75
76         # Should return either None or an iterable of Response, dict
77         # or Item objects.
78         pass
79
80     # 只在爬虫启动时,执行一次。
81     def process_start_requests(self, start_requests, spider):
82         # Called with the start requests of the spider, and works
83         # similarly to the process_spider_output() method, except
84         # that it doesn’t have a response associated.
85
86         # Must return only requests (not items).
87         for r in start_requests:
88             yield r

一般情况下这个我们不做修改,使用内置的,在爬虫中间件中先执行所有的的input方法,再执行所有的output方法

14.定制命令

我们之前所有的运行爬虫都是在命令行中运行的,非常的麻烦,那么我们可以定制一个脚本用来执行这些命令,在根目录上写上

start.py:

import sys
from scrapy.cmdline import execute

if __name__ == ‘__main__‘:
    # execute(["scrapy","crawl","chouti","--nolog"]) #单个爬虫

自定制命令:

  • 在spiders同级创建任意目录,如:commands
  • 在其中创建 crawlall.py 文件 (此处文件名就是自定义的命令)
from scrapy.commands import ScrapyCommand
    from scrapy.utils.project import get_project_settings

    class Command(ScrapyCommand):

        requires_project = True

        def syntax(self):
            return ‘[options]‘

        def short_desc(self):
            return ‘Runs all of the spiders‘

        def run(self, args, opts):
            spider_list = self.crawler_process.spiders.list()
            for name in spider_list:
                self.crawler_process.crawl(name, **opts.__dict__)
            self.crawler_process.start()

crawlall.py
  • 在settings.py 中添加配置 COMMANDS_MODULE = ‘项目名称.目录名称‘ #COMMANDS_MODULE = "cjkscrapy.commands"
  • 在项目目录执行命令:scrapy crawlall

我们查看scrapy --help,我们添加的命令就在其中了

 1 chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy --help
 2 Scrapy 1.5.1 - project: cjkscrapy
 3
 4 Usage:
 5   scrapy <command> [options] [args]
 6
 7 Available commands:
 8   bench         Run quick benchmark test
 9   check         Check spider contracts
10   crawl         Run a spider
11   crawlall      Runs all of the spiders
12   edit          Edit spider
13   fetch         Fetch a URL using the Scrapy downloader
14   genspider     Generate new spider using pre-defined templates
15   list          List available spiders
16   parse         Parse URL (using its spider) and print the results
17   runspider     Run a self-contained spider (without creating a project)
18   settings      Get settings values
19   shell         Interactive scraping console
20   startproject  Create new project
21   version       Print Scrapy version
22   view          Open URL in browser, as seen by Scrapy

15.scrapy信号

新建一个文件:ext.py

from scrapy import signals

class MyExtend(object):
    def __init__(self):
        pass

    @classmethod
    def from_crawler(cls, crawler):
        self = cls()

        crawler.signals.connect(self.x1, signal=signals.spider_opened)
        crawler.signals.connect(self.x2, signal=signals.spider_closed)

        return self

    def x1(self, spider):
        print(‘open‘)

    def x2(self, spider):
        print(‘close‘)

在setting配置文件里面设置:

EXTENSIONS = {
    ‘scrapy.extensions.telnet.TelnetConsole‘: None,
    ‘cjkscrapy.ext.MyExtend‘: 666
}

说明:crawler.signals.connect(self.x1, signal=signals.spider_opened)表示将x1这个函数注册到spider_opened里面

#与引擎相关的
engine_started = object()
engine_stopped = object()
#与爬虫相关的
spider_opened = object()
spider_idle = object() #爬虫闲置的时候
spider_closed = object()
spider_error = object()
#与请求相关的
request_scheduled = object()#将请求放在调度器的时候
request_dropped = object()#将请求丢失的时候
#与相应相关的
response_received = object()
response_downloaded = object()
#与item相关的
item_scraped = object()
item_dropped = object()

运行结果:

chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog
open爬虫
response <200 https://dig.chouti.com/>
close爬虫

16.scrapy_redis

scrapy_redis这个是干嘛的呢?这个组件是帮我们开发分布式爬虫的组件

redis操作:

 1 import redis
 2
 3 conn = redis.Redis(host=‘140.143.227.206‘, port=8888, password=‘beta‘)
 4 keys = conn.keys()
 5 print(keys)
 6 print(conn.smembers(‘dupefilter:xiaodongbei‘))
 7 # print(conn.smembers(‘visited_urls‘))
 8 # v1 = conn.sadd(‘urls‘,‘http://www.baidu.com‘)
 9 # v2 = conn.sadd(‘urls‘,‘http://www.cnblogs.com‘)
10 # print(v1)
11 # print(v2)
12 # v3 = conn.sadd(‘urls‘,‘http://www.bing.com‘)
13 # print(v3)
14
15 # result = conn.sadd(‘urls‘,‘http://www.bing.com‘)
16 # if result == 1:
17 #     print(‘之前未访问过‘)
18 # else:
19 #     print(‘之前访问过‘)
20 # print(conn.smembers(‘urls‘))

(1).scrapy_redis去重

代码示例:

方式一:完全自定义

自定义去重规则:随便写一个文件:xxx.py

from scrapy.dupefilter import BaseDupeFilter
import redis
from scrapy.utils.request import request_fingerprint
import scrapy_redis

class DupFilter(BaseDupeFilter):
    def __init__(self):
        self.conn = redis.Redis(host=‘140.143.227.206‘, port=8888, password=‘beta‘)

    def request_seen(self, request):
        """
        检测当前请求是否已经被访问过
        :param request:
        :return: True表示已经访问过;False表示未访问过
        """
        fid = request_fingerprint(request)
        result = self.conn.sadd(‘visited_urls‘, fid)
        if result == 1:
            return False
        return True

setting里面设置:

DUPEFILTER_CLASS = ‘dbd.xxx.DupFilter‘

方式二:在scrapy_redis上定制

我们通过源码分析:导入:import redis

路径:/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy_redis/dupefilter.py

RFPDupeFilter类:这个类也继承了BaseDupeFilter,跟我们的方式一差不多

  1 import logging
  2 import time
  3
  4 from scrapy.dupefilters import BaseDupeFilter
  5 from scrapy.utils.request import request_fingerprint
  6
  7 from . import defaults
  8 from .connection import get_redis_from_settings
  9
 10
 11 logger = logging.getLogger(__name__)
 12
 13
 14 # TODO: Rename class to RedisDupeFilter.
 15 class RFPDupeFilter(BaseDupeFilter):
 16     """Redis-based request duplicates filter.
 17
 18     This class can also be used with default Scrapy‘s scheduler.
 19
 20     """
 21
 22     logger = logger
 23
 24     def __init__(self, server, key, debug=False):
 25         """Initialize the duplicates filter.
 26
 27         Parameters
 28         ----------
 29         server : redis.StrictRedis
 30             The redis server instance.
 31         key : str
 32             Redis key Where to store fingerprints.
 33         debug : bool, optional
 34             Whether to log filtered requests.
 35
 36         """
 37         self.server = server
 38         self.key = key
 39         self.debug = debug
 40         self.logdupes = True
 41
 42     @classmethod
 43     def from_settings(cls, settings):
 44         """Returns an instance from given settings.
 45
 46         This uses by default the key ``dupefilter:<timestamp>``. When using the
 47         ``scrapy_redis.scheduler.Scheduler`` class, this method is not used as
 48         it needs to pass the spider name in the key.
 49
 50         Parameters
 51         ----------
 52         settings : scrapy.settings.Settings
 53
 54         Returns
 55         -------
 56         RFPDupeFilter
 57             A RFPDupeFilter instance.
 58
 59
 60         """
 61         server = get_redis_from_settings(settings)
 62         # XXX: This creates one-time key. needed to support to use this
 63         # class as standalone dupefilter with scrapy‘s default scheduler
 64         # if scrapy passes spider on open() method this wouldn‘t be needed
 65         # TODO: Use SCRAPY_JOB env as default and fallback to timestamp.
 66         key = defaults.DUPEFILTER_KEY % {‘timestamp‘: int(time.time())}
 67         debug = settings.getbool(‘DUPEFILTER_DEBUG‘)
 68         return cls(server, key=key, debug=debug)
 69
 70     @classmethod
 71     def from_crawler(cls, crawler):
 72         """Returns instance from crawler.
 73
 74         Parameters
 75         ----------
 76         crawler : scrapy.crawler.Crawler
 77
 78         Returns
 79         -------
 80         RFPDupeFilter
 81             Instance of RFPDupeFilter.
 82
 83         """
 84         return cls.from_settings(crawler.settings)
 85
 86     def request_seen(self, request):
 87         """Returns True if request was already seen.
 88
 89         Parameters
 90         ----------
 91         request : scrapy.http.Request
 92
 93         Returns
 94         -------
 95         bool
 96
 97         """
 98         fp = self.request_fingerprint(request)
 99         # This returns the number of values added, zero if already exists.
100         added = self.server.sadd(self.key, fp)
101         return added == 0
102
103     def request_fingerprint(self, request):
104         """Returns a fingerprint for a given request.
105
106         Parameters
107         ----------
108         request : scrapy.http.Request
109
110         Returns
111         -------
112         str
113
114         """
115         return request_fingerprint(request)
116
117     def close(self, reason=‘‘):
118         """Delete data on close. Called by Scrapy‘s scheduler.
119
120         Parameters
121         ----------
122         reason : str, optional
123
124         """
125         self.clear()
126
127     def clear(self):
128         """Clears fingerprints data."""
129         self.server.delete(self.key)
130
131     def log(self, request, spider):
132         """Logs given request.
133
134         Parameters
135         ----------
136         request : scrapy.http.Request
137         spider : scrapy.spiders.Spider
138
139         """
140         if self.debug:
141             msg = "Filtered duplicate request: %(request)s"
142             self.logger.debug(msg, {‘request‘: request}, extra={‘spider‘: spider})
143         elif self.logdupes:
144             msg = ("Filtered duplicate request %(request)s"
145                    " - no more duplicates will be shown"
146                    " (see DUPEFILTER_DEBUG to show all duplicates)")
147             self.logger.debug(msg, {‘request‘: request}, extra={‘spider‘: spider})
148             self.logdupes = False

这个类里面也是通过request_seen来去重的

    def request_seen(self, request):
        """Returns True if request was already seen.

        Parameters
        ----------
        request : scrapy.http.Request

        Returns
        -------
        bool

        """
        fp = self.request_fingerprint(request)
        # This returns the number of values added, zero if already exists.
        added = self.server.sadd(self.key, fp)
        return added == 0

说明:

  1 1.fp = self.request_fingerprint(request):创建唯一标识
  2 2.added = self.server.sadd(self.key, fp):
  3 self.server = server:对应的redis链接
  4 (1)[email protected]
  5     def from_settings(cls, settings):
  6         """Returns an instance from given settings.
  7
  8         This uses by default the key ``dupefilter:<timestamp>``. When using the
  9         ``scrapy_redis.scheduler.Scheduler`` class, this method is not used as
 10         it needs to pass the spider name in the key.
 11
 12         Parameters
 13         ----------
 14         settings : scrapy.settings.Settings
 15
 16         Returns
 17         -------
 18         RFPDupeFilter
 19             A RFPDupeFilter instance.
 20
 21
 22         """
 23         server = get_redis_from_settings(settings)
 24         # XXX: This creates one-time key. needed to support to use this
 25         # class as standalone dupefilter with scrapy‘s default scheduler
 26         # if scrapy passes spider on open() method this wouldn‘t be needed
 27         # TODO: Use SCRAPY_JOB env as default and fallback to timestamp.
 28         key = defaults.DUPEFILTER_KEY % {‘timestamp‘: int(time.time())}
 29         debug = settings.getbool(‘DUPEFILTER_DEBUG‘)
 30         return cls(server, key=key, debug=debug)
 31
 32 (2).def get_redis_from_settings(settings):
 33     """Returns a redis client instance from given Scrapy settings object.
 34
 35     This function uses ``get_client`` to instantiate the client and uses
 36     ``defaults.REDIS_PARAMS`` global as defaults values for the parameters. You
 37     can override them using the ``REDIS_PARAMS`` setting.
 38
 39     Parameters
 40     ----------
 41     settings : Settings
 42         A scrapy settings object. See the supported settings below.
 43
 44     Returns
 45     -------
 46     server
 47         Redis client instance.
 48
 49     Other Parameters
 50     ----------------
 51     REDIS_URL : str, optional
 52         Server connection URL.
 53     REDIS_HOST : str, optional
 54         Server host.
 55     REDIS_PORT : str, optional
 56         Server port.
 57     REDIS_ENCODING : str, optional
 58         Data encoding.
 59     REDIS_PARAMS : dict, optional
 60         Additional client parameters.
 61
 62     """
 63     params = defaults.REDIS_PARAMS.copy()
 64     params.update(settings.getdict(‘REDIS_PARAMS‘))
 65     # XXX: Deprecate REDIS_* settings.
 66     for source, dest in SETTINGS_PARAMS_MAP.items():
 67         val = settings.get(source)
 68         if val:
 69             params[dest] = val
 70
 71     # Allow ``redis_cls`` to be a path to a class.
 72     if isinstance(params.get(‘redis_cls‘), six.string_types):
 73         params[‘redis_cls‘] = load_object(params[‘redis_cls‘])
 74
 75     return get_redis(**params)
 76 (3).def get_redis(**kwargs):
 77     """Returns a redis client instance.
 78
 79     Parameters
 80     ----------
 81     redis_cls : class, optional
 82         Defaults to ``redis.StrictRedis``.
 83     url : str, optional
 84         If given, ``redis_cls.from_url`` is used to instantiate the class.
 85     **kwargs
 86         Extra parameters to be passed to the ``redis_cls`` class.
 87
 88     Returns
 89     -------
 90     server
 91         Redis client instance.
 92
 93     """
 94     redis_cls = kwargs.pop(‘redis_cls‘, defaults.REDIS_CLS)
 95     url = kwargs.pop(‘url‘, None)
 96     if url:
 97         return redis_cls.from_url(url, **kwargs)
 98     else:
 99         return redis_cls(**kwargs)
100 (4).redis_cls = kwargs.pop(‘redis_cls‘, defaults.REDIS_CLS)
101 REDIS_CLS = redis.StrictRedis
102
103 因此在配置文件setting里面要加
104 # ############### scrapy redis连接 ####################
105
106 REDIS_HOST = ‘140.143.227.206‘                            # 主机名
107 REDIS_PORT = 8888                                   # 端口
108 REDIS_PARAMS  = {‘password‘:‘beta‘}                                  # Redis连接参数             默认:REDIS_PARAMS = {‘socket_timeout‘: 30,‘socket_connect_timeout‘: 30,‘retry_on_timeout‘: True,‘encoding‘: REDIS_ENCODING,})
109 REDIS_ENCODING = "utf-8"                            # redis编码类型             默认:‘utf-8‘
110
111 # REDIS_URL = ‘redis://user:[email protected]:9001‘       # 连接URL(优先于以上配置)
112 DUPEFILTER_KEY = ‘dupefilter:%(timestamp)s‘
113
114 self.key = key:key = defaults.DUPEFILTER_KEY % {‘timestamp‘: int(time.time())}
115 3.return added == 0
116 如果为0的话就代表访问过了,也就是返回的true,如果范湖为1的话那么代表没有访问过,也就是返回False

代码示例:

from scrapy_redis.dupefilter import RFPDupeFilter
from scrapy_redis.connection import get_redis_from_settings
from scrapy_redis import defaults

class RedisDupeFilter(RFPDupeFilter):
    @classmethod
    def from_settings(cls, settings):
        """Returns an instance from given settings.

        This uses by default the key ``dupefilter:<timestamp>``. When using the
        ``scrapy_redis.scheduler.Scheduler`` class, this method is not used as
        it needs to pass the spider name in the key.

        Parameters
        ----------
        settings : scrapy.settings.Settings

        Returns
        -------
        RFPDupeFilter
            A RFPDupeFilter instance.

        """
        server = get_redis_from_settings(settings)
        # XXX: This creates one-time key. needed to support to use this
        # class as standalone dupefilter with scrapy‘s default scheduler
        # if scrapy passes spider on open() method this wouldn‘t be needed
        # TODO: Use SCRAPY_JOB env as default and fallback to timestamp.
        key = defaults.DUPEFILTER_KEY % {‘timestamp‘: ‘chenjunkan‘}
        debug = settings.getbool(‘DUPEFILTER_DEBUG‘)
        return cls(server, key=key, debug=debug)

这里面的key默认是用时间戳的形式,而我们可以将其写死key = defaults.DUPEFILTER_KEY % {‘timestamp‘: ‘chenjunkan‘}

最后在配置文件里面:

# ############### scrapy redis连接 ####################

REDIS_HOST = ‘140.143.227.206‘                            # 主机名
REDIS_PORT = 8888                                   # 端口
REDIS_PARAMS  = {‘password‘:‘beta‘}                                  # Redis连接参数             默认:REDIS_PARAMS = {‘socket_timeout‘: 30,‘socket_connect_timeout‘: 30,‘retry_on_timeout‘: True,‘encoding‘: REDIS_ENCODING,})
REDIS_ENCODING = "utf-8"                            # redis编码类型             默认:‘utf-8‘

# REDIS_URL = ‘redis://user:[email protected]:9001‘       # 连接URL(优先于以上配置)
DUPEFILTER_KEY = ‘dupefilter:%(timestamp)s‘

# DUPEFILTER_CLASS = ‘scrapy_redis.dupefilter.RFPDupeFilter‘
DUPEFILTER_CLASS = ‘cjkscrapy.xxx.RedisDupeFilter‘

(2).scrapy_redis队列

redis实现队列和栈:

查看源码:导入 import scrapy_redis

路径:/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy_redis/queue.py

queue.py:

  1 from scrapy.utils.reqser import request_to_dict, request_from_dict
  2
  3 from . import picklecompat
  4
  5
  6 class Base(object):
  7     """Per-spider base queue class"""
  8
  9     def __init__(self, server, spider, key, serializer=None):
 10         """Initialize per-spider redis queue.
 11
 12         Parameters
 13         ----------
 14         server : StrictRedis
 15             Redis client instance.
 16         spider : Spider
 17             Scrapy spider instance.
 18         key: str
 19             Redis key where to put and get messages.
 20         serializer : object
 21             Serializer object with ``loads`` and ``dumps`` methods.
 22
 23         """
 24         if serializer is None:
 25             # Backward compatibility.
 26             # TODO: deprecate pickle.
 27             serializer = picklecompat
 28         if not hasattr(serializer, ‘loads‘):
 29             raise TypeError("serializer does not implement ‘loads‘ function: %r"
 30                             % serializer)
 31         if not hasattr(serializer, ‘dumps‘):
 32             raise TypeError("serializer ‘%s‘ does not implement ‘dumps‘ function: %r"
 33                             % serializer)
 34
 35         self.server = server
 36         self.spider = spider
 37         self.key = key % {‘spider‘: spider.name}
 38         self.serializer = serializer
 39
 40     def _encode_request(self, request):
 41         """Encode a request object"""
 42         obj = request_to_dict(request, self.spider)
 43         return self.serializer.dumps(obj)
 44
 45     def _decode_request(self, encoded_request):
 46         """Decode an request previously encoded"""
 47         obj = self.serializer.loads(encoded_request)
 48         return request_from_dict(obj, self.spider)
 49
 50     def __len__(self):
 51         """Return the length of the queue"""
 52         raise NotImplementedError
 53
 54     def push(self, request):
 55         """Push a request"""
 56         raise NotImplementedError
 57
 58     def pop(self, timeout=0):
 59         """Pop a request"""
 60         raise NotImplementedError
 61
 62     def clear(self):
 63         """Clear queue/stack"""
 64         self.server.delete(self.key)
 65
 66
 67 class FifoQueue(Base):
 68     """Per-spider FIFO queue"""
 69
 70     def __len__(self):
 71         """Return the length of the queue"""
 72         return self.server.llen(self.key)
 73
 74     def push(self, request):
 75         """Push a request"""
 76         self.server.lpush(self.key, self._encode_request(request))
 77
 78     def pop(self, timeout=0):
 79         """Pop a request"""
 80         if timeout > 0:
 81             data = self.server.brpop(self.key, timeout)
 82             if isinstance(data, tuple):
 83                 data = data[1]
 84         else:
 85             data = self.server.rpop(self.key)
 86         if data:
 87             return self._decode_request(data)
 88
 89
 90 class PriorityQueue(Base):
 91     """Per-spider priority queue abstraction using redis‘ sorted set"""
 92
 93     def __len__(self):
 94         """Return the length of the queue"""
 95         return self.server.zcard(self.key)
 96
 97     def push(self, request):
 98         """Push a request"""
 99         data = self._encode_request(request)
100         score = -request.priority
101         # We don‘t use zadd method as the order of arguments change depending on
102         # whether the class is Redis or StrictRedis, and the option of using
103         # kwargs only accepts strings, not bytes.
104         self.server.execute_command(‘ZADD‘, self.key, score, data)
105
106     def pop(self, timeout=0):
107         """
108         Pop a request
109         timeout not support in this queue class
110         """
111         # use atomic range/remove using multi/exec
112         pipe = self.server.pipeline()
113         pipe.multi()
114         pipe.zrange(self.key, 0, 0).zremrangebyrank(self.key, 0, 0)
115         results, count = pipe.execute()
116         if results:
117             return self._decode_request(results[0])
118
119
120 class LifoQueue(Base):
121     """Per-spider LIFO queue."""
122
123     def __len__(self):
124         """Return the length of the stack"""
125         return self.server.llen(self.key)
126
127     def push(self, request):
128         """Push a request"""
129         self.server.lpush(self.key, self._encode_request(request))
130
131     def pop(self, timeout=0):
132         """Pop a request"""
133         if timeout > 0:
134             data = self.server.blpop(self.key, timeout)
135             if isinstance(data, tuple):
136                 data = data[1]
137         else:
138             data = self.server.lpop(self.key)
139
140         if data:
141             return self._decode_request(data)
142
143
144 # TODO: Deprecate the use of these names.
145 SpiderQueue = FifoQueue
146 SpiderStack = LifoQueue
147 SpiderPriorityQueue = PriorityQueue

队列:先进先出

代码示例:

import scrapy_redis
import redis

class FifoQueue(object):
    def __init__(self):
        self.server = redis.Redis(host=‘140.143.227.206‘,port=8888,password=‘beta‘)

    def push(self, request):
        """Push a request"""
        self.server.lpush(‘USERS‘, request)

    def pop(self, timeout=0):
        """Pop a request"""
        data = self.server.rpop(‘USERS‘)
        return data
# [33,22,11]
q = FifoQueue()
q.push(11)
q.push(22)
q.push(33)

print(q.pop())
print(q.pop())
print(q.pop())

说明:lpush从左边添加,rpop从右边拿出来,也就是先进先出,广度优先

栈:后进先出:

import redis

class LifoQueue(object):
    """Per-spider LIFO queue."""
    def __init__(self):
        self.server = redis.Redis(host=‘140.143.227.206‘,port=8888,password=‘beta‘)

    def push(self, request):
        """Push a request"""
        self.server.lpush("USERS", request)

    def pop(self, timeout=0):
        """Pop a request"""
        data = self.server.lpop(‘USERS‘)
        return

# [33,22,11]

说明:lpush从左边进去,lpop从左边出去,后进先出,深度优先

zadd,zrange函数:

import redis

conn = redis.Redis(host=‘140.143.227.206‘, port=8888, password=‘beta‘)
conn.zadd(‘score‘, cjk=79, pyy=33, cc=73)

print(conn.keys())

v = conn.zrange(‘score‘, 0, 8, desc=True)
print(v)

pipe = conn.pipeline()# 打包,一次性执行多条命令
pipe.multi()
pipe.zrange("score", 0, 0).zremrangebyrank(‘score‘, 0, 0)# 表示默认从小到大排序,只取第一个;根据排名删除第一个
results, count = pipe.execute()
print(results, count)

zadd函数:将 cjk=79, pyy=33, cc=73三个值放入redis里面去,名字为score

zrange函数:可以给score设置一个区间,让他按照从小到大进行排序;v = conn.zrange(‘score‘, 0, 8, desc=True)如果将desc设置为true表示按照分值从大到小排取

优先级队列:

import redis

class PriorityQueue(object):
    """Per-spider priority queue abstraction using redis‘ sorted set"""

    def __init__(self):
        self.server = redis.Redis(host=‘140.143.227.206‘, port=8888, password=‘beta‘)

    def push(self, request, score):
        """Push a request"""
        # data = self._encode_request(request)
        # score = -request.priority
        # We don‘t use zadd method as the order of arguments change depending on
        # whether the class is Redis or StrictRedis, and the option of using
        # kwargs only accepts strings, not bytes.
        self.server.execute_command(‘ZADD‘, ‘xxxxxx‘, score, request)

    def pop(self, timeout=0):
        """
        Pop a request
        timeout not support in this queue class
        """
        # use atomic range/remove using multi/exec
        pipe = self.server.pipeline()
        pipe.multi()
        pipe.zrange(‘xxxxxx‘, 0, 0).zremrangebyrank(‘xxxxxx‘, 0, 0)
        results, count = pipe.execute()
        if results:
            return results[0]

q = PriorityQueue()

q.push(‘alex‘, 99)
q.push(‘oldboy‘, 56)
q.push(‘eric‘, 77)

v1 = q.pop()
print(v1)
v2 = q.pop()
print(v2)
v3 = q.pop()
print(v3)

说明:如果分数一样的话,那么就根据名字来排优先级,如果按照优先级从小到大就是广度优先,如果从大到小就是深度优先

(3)scrapy_redis调度器

调度器的配置文件:

# ###################### 调度器 ######################
from scrapy_redis.scheduler import Scheduler
# 由scrapy_redis的调度器来进行负责调配
# enqueue_request: 向调度器中添加任务
# next_request: 去调度器中获取一个任务
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# 规定任务存放的顺序
# 优先级
DEPTH_PRIORITY = 1  # 广度优先
# DEPTH_PRIORITY = -1 # 深度优先
SCHEDULER_QUEUE_CLASS = ‘scrapy_redis.queue.PriorityQueue‘  # 默认使用优先级队列(默认),其他:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表)

# 广度优先
# SCHEDULER_QUEUE_CLASS = ‘scrapy_redis.queue.FifoQueue‘  # 默认使用优先级队列(默认),其他:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表)
# 深度优先
# SCHEDULER_QUEUE_CLASS = ‘scrapy_redis.queue.LifoQueue‘  # 默认使用优先级队列(默认),其他:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表)

"""
redis = {
    chouti:requests:[
        pickle.dumps(Request(url=‘Http://wwwww‘,callback=self.parse)),
        pickle.dumps(Request(url=‘Http://wwwww‘,callback=self.parse)),
        pickle.dumps(Request(url=‘Http://wwwww‘,callback=self.parse)),
    ],
    cnblogs:requests:[

    ]
}
"""
SCHEDULER_QUEUE_KEY = ‘%(spider)s:requests‘  # 调度器中请求存放在redis中的key

SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"  # 对保存到redis中的数据进行序列化,默认使用pickle

SCHEDULER_PERSIST = False  # 是否在关闭时候保留原来的调度器和去重记录,True=保留,False=清空
SCHEDULER_FLUSH_ON_START = True  # 是否在开始之前清空 调度器和去重记录,True=清空,False=不清空
# SCHEDULER_IDLE_BEFORE_CLOSE = 10  # 去调度器中获取数据时,如果为空,最多等待时间(最后没数据,未获取到)。

SCHEDULER_DUPEFILTER_KEY = ‘%(spider)s:dupefilter‘  # 去重规则,在redis中保存时对应的key
SCHEDULER_DUPEFILTER_CLASS = ‘scrapy_redis.dupefilter.RFPDupeFilter‘  # 去重规则对应处理的类

说明:

a.from scrapy_redis.scheduler import Scheduler:表示由scrapy的调度器来调度

  1 import importlib
  2 import six
  3
  4 from scrapy.utils.misc import load_object
  5
  6 from . import connection, defaults
  7
  8
  9 # TODO: add SCRAPY_JOB support.
 10 class Scheduler(object):
 11     """Redis-based scheduler
 12
 13     Settings
 14     --------
 15     SCHEDULER_PERSIST : bool (default: False)
 16         Whether to persist or clear redis queue.
 17     SCHEDULER_FLUSH_ON_START : bool (default: False)
 18         Whether to flush redis queue on start.
 19     SCHEDULER_IDLE_BEFORE_CLOSE : int (default: 0)
 20         How many seconds to wait before closing if no message is received.
 21     SCHEDULER_QUEUE_KEY : str
 22         Scheduler redis key.
 23     SCHEDULER_QUEUE_CLASS : str
 24         Scheduler queue class.
 25     SCHEDULER_DUPEFILTER_KEY : str
 26         Scheduler dupefilter redis key.
 27     SCHEDULER_DUPEFILTER_CLASS : str
 28         Scheduler dupefilter class.
 29     SCHEDULER_SERIALIZER : str
 30         Scheduler serializer.
 31
 32     """
 33
 34     def __init__(self, server,
 35                  persist=False,
 36                  flush_on_start=False,
 37                  queue_key=defaults.SCHEDULER_QUEUE_KEY,
 38                  queue_cls=defaults.SCHEDULER_QUEUE_CLASS,
 39                  dupefilter_key=defaults.SCHEDULER_DUPEFILTER_KEY,
 40                  dupefilter_cls=defaults.SCHEDULER_DUPEFILTER_CLASS,
 41                  idle_before_close=0,
 42                  serializer=None):
 43         """Initialize scheduler.
 44
 45         Parameters
 46         ----------
 47         server : Redis
 48             The redis server instance.
 49         persist : bool
 50             Whether to flush requests when closing. Default is False.
 51         flush_on_start : bool
 52             Whether to flush requests on start. Default is False.
 53         queue_key : str
 54             Requests queue key.
 55         queue_cls : str
 56             Importable path to the queue class.
 57         dupefilter_key : str
 58             Duplicates filter key.
 59         dupefilter_cls : str
 60             Importable path to the dupefilter class.
 61         idle_before_close : int
 62             Timeout before giving up.
 63
 64         """
 65         if idle_before_close < 0:
 66             raise TypeError("idle_before_close cannot be negative")
 67
 68         self.server = server
 69         self.persist = persist
 70         self.flush_on_start = flush_on_start
 71         self.queue_key = queue_key
 72         self.queue_cls = queue_cls
 73         self.dupefilter_cls = dupefilter_cls
 74         self.dupefilter_key = dupefilter_key
 75         self.idle_before_close = idle_before_close
 76         self.serializer = serializer
 77         self.stats = None
 78
 79     def __len__(self):
 80         return len(self.queue)
 81
 82     @classmethod
 83     def from_settings(cls, settings):
 84         kwargs = {
 85             ‘persist‘: settings.getbool(‘SCHEDULER_PERSIST‘),
 86             ‘flush_on_start‘: settings.getbool(‘SCHEDULER_FLUSH_ON_START‘),
 87             ‘idle_before_close‘: settings.getint(‘SCHEDULER_IDLE_BEFORE_CLOSE‘),
 88         }
 89
 90         # If these values are missing, it means we want to use the defaults.
 91         optional = {
 92             # TODO: Use custom prefixes for this settings to note that are
 93             # specific to scrapy-redis.
 94             ‘queue_key‘: ‘SCHEDULER_QUEUE_KEY‘,
 95             ‘queue_cls‘: ‘SCHEDULER_QUEUE_CLASS‘,
 96             ‘dupefilter_key‘: ‘SCHEDULER_DUPEFILTER_KEY‘,
 97             # We use the default setting name to keep compatibility.
 98             ‘dupefilter_cls‘: ‘DUPEFILTER_CLASS‘,
 99             ‘serializer‘: ‘SCHEDULER_SERIALIZER‘,
100         }
101         for name, setting_name in optional.items():
102             val = settings.get(setting_name)
103             if val:
104                 kwargs[name] = val
105
106         # Support serializer as a path to a module.
107         if isinstance(kwargs.get(‘serializer‘), six.string_types):
108             kwargs[‘serializer‘] = importlib.import_module(kwargs[‘serializer‘])
109
110         server = connection.from_settings(settings)
111         # Ensure the connection is working.
112         server.ping()
113
114         return cls(server=server, **kwargs)
115
116     @classmethod
117     def from_crawler(cls, crawler):
118         instance = cls.from_settings(crawler.settings)
119         # FIXME: for now, stats are only supported from this constructor
120         instance.stats = crawler.stats
121         return instance
122
123     def open(self, spider):
124         self.spider = spider
125
126         try:
127             self.queue = load_object(self.queue_cls)(
128                 server=self.server,
129                 spider=spider,
130                 key=self.queue_key % {‘spider‘: spider.name},
131                 serializer=self.serializer,
132             )
133         except TypeError as e:
134             raise ValueError("Failed to instantiate queue class ‘%s‘: %s",
135                              self.queue_cls, e)
136
137         try:
138             self.df = load_object(self.dupefilter_cls)(
139                 server=self.server,
140                 key=self.dupefilter_key % {‘spider‘: spider.name},
141                 debug=spider.settings.getbool(‘DUPEFILTER_DEBUG‘),
142             )
143         except TypeError as e:
144             raise ValueError("Failed to instantiate dupefilter class ‘%s‘: %s",
145                              self.dupefilter_cls, e)
146
147         if self.flush_on_start:
148             self.flush()
149         # notice if there are requests already in the queue to resume the crawl
150         if len(self.queue):
151             spider.log("Resuming crawl (%d requests scheduled)" % len(self.queue))
152
153     def close(self, reason):
154         if not self.persist:
155             self.flush()
156
157     def flush(self):
158         self.df.clear()
159         self.queue.clear()
160
161     def enqueue_request(self, request):
162         if not request.dont_filter and self.df.request_seen(request):
163             self.df.log(request, self.spider)
164             return False
165         if self.stats:
166             self.stats.inc_value(‘scheduler/enqueued/redis‘, spider=self.spider)
167         self.queue.push(request)
168         return True
169
170     def next_request(self):
171         block_pop_timeout = self.idle_before_close
172         request = self.queue.pop(block_pop_timeout)
173         if request and self.stats:
174             self.stats.inc_value(‘scheduler/dequeued/redis‘, spider=self.spider)
175         return request
176
177     def has_pending_requests(self):
178         return len(self) > 0

在这个类里面比较重要的方法:enqueue_request和next_request:

当我们的爬虫程序运行起来的时候,只有一个调度器,调用enqueue_request往队列里面加入一个请求,添加一个任务的时候就是调用enqueue_request,如果下载器来去任务就会调用next_request,因此我们肯定会调用队列;

b.SCHEDULER_QUEUE_CLASS = ‘scrapy_redis.queue.PriorityQueue‘:默认使用的是优先级队列

队列其实就是redis里面的那个列表,这个列表应该会有一个key:SCHEDULER_QUEUE_KEY = ‘%(spider)s:requests‘ # 调度器中请求存放在redis中的key,取得是当前爬虫的名称;因为redis队列里面只能放字符串,那么我们需要通过SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"  进行序列化# 对保存到redis中的数据进行序列化,默认使用pickle

 1 """A pickle wrapper module with protocol=-1 by default."""
 2
 3 try:
 4     import cPickle as pickle  # PY2
 5 except ImportError:
 6     import pickle
 7
 8
 9 def loads(s):
10     return pickle.loads(s)
11
12
13 def dumps(obj):
14     return pickle.dumps(obj, protocol=-1)

c.SCHEDULER_IDLE_BEFORE_CLOSE = 10 # 去调度器中获取数据时,如果为空,最多等待时间(最后没数据,未获取到)。表示阻塞的去取任务

import redis

conn = redis.Redis(host=‘140.143.227.206‘,port=8888,password=‘beta‘)

# conn.flushall()
print(conn.keys())
# chouti:dupefilter/chouti:request

# conn.lpush(‘xxx:request‘,‘http://wwww.xxx.com‘)
# conn.lpush(‘xxx:request‘,‘http://wwww.xxx1.com‘)

# print(conn.lpop(‘xxx:request‘))
# print(conn.blpop(‘xxx:request‘,timeout=10))

整个的执行流程:

1. 当执行scrapy crawl chouti --nolog

2. 找到 SCHEDULER = "scrapy_redis.scheduler.Scheduler" 配置并实例化调度器对象
    - 执行Scheduler.from_crawler
 @classmethod
    def from_crawler(cls, crawler):
        instance = cls.from_settings(crawler.settings)
        # FIXME: for now, stats are only supported from this constructor
        instance.stats = crawler.stats
        return instance

    - 执行Scheduler.from_settings
@classmethod
    def from_settings(cls, settings):
        kwargs = {
            ‘persist‘: settings.getbool(‘SCHEDULER_PERSIST‘),
            ‘flush_on_start‘: settings.getbool(‘SCHEDULER_FLUSH_ON_START‘),
            ‘idle_before_close‘: settings.getint(‘SCHEDULER_IDLE_BEFORE_CLOSE‘),
        }

        # If these values are missing, it means we want to use the defaults.
        optional = {
            # TODO: Use custom prefixes for this settings to note that are
            # specific to scrapy-redis.
            ‘queue_key‘: ‘SCHEDULER_QUEUE_KEY‘,
            ‘queue_cls‘: ‘SCHEDULER_QUEUE_CLASS‘,
            ‘dupefilter_key‘: ‘SCHEDULER_DUPEFILTER_KEY‘,
            # We use the default setting name to keep compatibility.
            ‘dupefilter_cls‘: ‘DUPEFILTER_CLASS‘,
            ‘serializer‘: ‘SCHEDULER_SERIALIZER‘,
        }
        for name, setting_name in optional.items():
            val = settings.get(setting_name)
            if val:
                kwargs[name] = val

        # Support serializer as a path to a module.
        if isinstance(kwargs.get(‘serializer‘), six.string_types):
            kwargs[‘serializer‘] = importlib.import_module(kwargs[‘serializer‘])

        server = connection.from_settings(settings)
        # Ensure the connection is working.
        server.ping()

        return cls(server=server, **kwargs)

        - 读取配置文件:
            SCHEDULER_PERSIST             # 是否在关闭时候保留原来的调度器和去重记录,True=保留,False=清空
            SCHEDULER_FLUSH_ON_START     # 是否在开始之前清空 调度器和去重记录,True=清空,False=不清空
            SCHEDULER_IDLE_BEFORE_CLOSE  # 去调度器中获取数据时,如果为空,最多等待时间(最后没数据,未获取到)。
        - 读取配置文件:
            SCHEDULER_QUEUE_KEY             # %(spider)s:requests
            SCHEDULER_QUEUE_CLASS         # scrapy_redis.queue.FifoQueue
            SCHEDULER_DUPEFILTER_KEY     # ‘%(spider)s:dupefilter‘
            DUPEFILTER_CLASS             # ‘scrapy_redis.dupefilter.RFPDupeFilter‘
            SCHEDULER_SERIALIZER         # "scrapy_redis.picklecompat"

        - 读取配置文件:
            REDIS_HOST = ‘140.143.227.206‘                            # 主机名
            REDIS_PORT = 8888                                   # 端口
            REDIS_PARAMS  = {‘password‘:‘beta‘}                                  # Redis连接参数             默认:REDIS_PARAMS = {‘socket_timeout‘: 30,‘socket_connect_timeout‘: 30,‘retry_on_timeout‘: True,‘encoding‘: REDIS_ENCODING,})
            REDIS_ENCODING = "utf-8"
    - 实例化Scheduler对象

3. 爬虫开始执行起始URL
    - 调用 scheduler.enqueue_requests()
        def enqueue_request(self, request):
            # 请求是否需要过滤?
            # 去重规则中是否已经有?(是否已经访问过,如果未访问添加到去重记录中。)
            if not request.dont_filter and self.df.request_seen(request):
                self.df.log(request, self.spider)
                # 已经访问过就不要再访问了
                return False

            if self.stats:
                self.stats.inc_value(‘scheduler/enqueued/redis‘, spider=self.spider)
            # print(‘未访问过,添加到调度器‘, request)
            self.queue.push(request)
            return True

4. 下载器去调度器中获取任务,去下载

    - 调用 scheduler.next_requests()
        def next_request(self):
            block_pop_timeout = self.idle_before_close
            request = self.queue.pop(block_pop_timeout)
            if request and self.stats:
                self.stats.inc_value(‘scheduler/dequeued/redis‘, spider=self.spider)
            return request

深度优先和广度优先的第二种方式:

# 规定任务存放的顺序
# 优先级
DEPTH_PRIORITY = 1  # 广度优先
# DEPTH_PRIORITY = -1 # 深度优先

说明:当爬虫爬取的时候我们的优先级设置的是越往深处越低request.priority -= depth * self.prio,当我们设置DEPTH_PRIORITY = 1的时候,也就是越往深度优先级越来越小,然后在我们的scrapy_redis里面score = -request.priority,这里面的分数就会变得越来越大,表示从小到大,说明是广度优先,反过来就是深度优先

scrapy中 调度器 和 队列  和 dupefilter的关系?

    调度器,调配添加或获取那个request.
    队列,存放request。
    dupefilter,访问记录。

注意:

# 优先使用DUPEFILTER_CLASS,如果没有就是用SCHEDULER_DUPEFILTER_CLASS
  SCHEDULER_DUPEFILTER_CLASS = ‘scrapy_redis.dupefilter.RFPDupeFilter‘  # 去重规则对应处理的类
        

(4).redis.spider

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
import scrapy_redis
from scrapy_redis.spiders import RedisSpider

class ChoutiSpider(RedisSpider):
    name = ‘chouti‘
    allowed_domains = [‘chouti.com‘]def parse(self, response):
        print(response)

说明:如果我们from scrapy_redis.spiders import RedisSpider,让我们的爬虫继承RedisSpider

 1 class RedisSpider(RedisMixin, Spider):
 2     """Spider that reads urls from redis queue when idle.
 3
 4     Attributes
 5     ----------
 6     redis_key : str (default: REDIS_START_URLS_KEY)
 7         Redis key where to fetch start URLs from..
 8     redis_batch_size : int (default: CONCURRENT_REQUESTS)
 9         Number of messages to fetch from redis on each attempt.
10     redis_encoding : str (default: REDIS_ENCODING)
11         Encoding to use when decoding messages from redis queue.
12
13     Settings
14     --------
15     REDIS_START_URLS_KEY : str (default: "<spider.name>:start_urls")
16         Default Redis key where to fetch start URLs from..
17     REDIS_START_URLS_BATCH_SIZE : int (deprecated by CONCURRENT_REQUESTS)
18         Default number of messages to fetch from redis on each attempt.
19     REDIS_START_URLS_AS_SET : bool (default: False)
20         Use SET operations to retrieve messages from the redis queue. If False,
21         the messages are retrieve using the LPOP command.
22     REDIS_ENCODING : str (default: "utf-8")
23         Default encoding to use when decoding messages from redis queue.
24
25     """
26
27     @classmethod
28     def from_crawler(self, crawler, *args, **kwargs):
29         obj = super(RedisSpider, self).from_crawler(crawler, *args, **kwargs)
30         obj.setup_redis(crawler)
31         return obj

可以控制起始的url是去列表里面取还是去集合里面取

fetch_one = self.server.spop if use_set else self.server.lpop

setting配置文件里面:

REDIS_START_URLS_AS_SET = False
import redis

conn = redis.Redis(host=‘140.143.227.206‘,port=8888,password=‘beta‘)

conn.lpush(‘chouti:start_urls‘,‘https://dig.chouti.com/r/pic/hot/1‘) #可以用来控制来一个链接就去执行一个链接,定制起始的url,可以源源不断的去发请求

(5).setting配置文件

1.BOT_NAME = ‘cjk‘ # 表示爬虫名称
2.SPIDER_MODULES = [‘cjk.spiders‘]# 爬虫所放在的目录
3.NEWSPIDER_MODULE = ‘cjk.spiders‘ #新创建爬虫的时候爬虫所在的目录
4.USER_AGENT = ‘dbd (+http://www.yourdomain.com)‘ #请求头
5.ROBOTSTXT_OBEY = False #如果是true的时候表示遵循别人的爬虫协议,False就表示
6.CONCURRENT_REQUESTS = 32# 所有的爬虫并发请求数,假设有两个爬虫,有可能一个爬虫15个,另外一个17个
7.DOWNLOAD_DELAY = 3#表示每次去下载页面的时候去延迟3秒
8.CONCURRENT_REQUESTS_PER_DOMAIN = 16 #每一域名并发16个,或者换句话说每一个爬虫16个
9.CONCURRENT_REQUESTS_PER_IP = 16# 有时候我们一个域名有可能会有多个ip假设为2个ip,那么总并发数为16*2=32个
10.COOKIES_ENABLED = False# 表示内部帮我们设置cookie
11. TELNETCONSOLE_ENABLED = True,#表示如果爬虫正在运行的时候可以让他终止,然后又重新运行
#TELNETCONSOLE_HOST = ‘127.0.0.1‘
# TELNETCONSOLE_PORT = [6023,]
12.DEFAULT_REQUEST_HEADERS# 默认请求头
13.#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
from scrapy.contrib.throttle import AutoThrottle
"""
自动限速算法
    from scrapy.contrib.throttle import AutoThrottle
    自动限速设置
    1. 获取最小延迟 DOWNLOAD_DELAY
    2. 获取最大延迟 AUTOTHROTTLE_MAX_DELAY
    3. 设置初始下载延迟 AUTOTHROTTLE_START_DELAY
    4. 当请求下载完成后,获取其"连接"时间 latency,即:请求连接到接受到响应头之间的时间
    5. 用于计算的... AUTOTHROTTLE_TARGET_CONCURRENCY
    target_delay = latency / self.target_concurrency
    new_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延迟时间
    new_delay = max(target_delay, new_delay)
    new_delay = min(max(self.mindelay, new_delay), self.maxdelay)
    slot.delay = new_delay
"""
14.# HTTPCACHE_ENABLED = True # 是否启用缓存
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = ‘httpcache‘
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = ‘scrapy.extensions.httpcache.FilesystemCacheStorage‘
#比如说我们在地铁上没有网,那么我们在出去之前我们可以将我们的网页下载到本地,然后去本地去拿,这样在我们本地就会出现一个叫httpcache的文件

原文地址:https://www.cnblogs.com/junkanchen/p/10225600.html

时间: 2024-11-05 21:58:12

爬虫相关之浅聊爬虫的相关文章

php爬虫抓取信息及反爬虫相关

58爬虫了百姓,赶集和58互爬,最后各种信息相同,都是爬虫后的数据库调用,潜规则啊,几家独大还暗中各种攻击,赶驴网的幽默事例我不想多评价.这个时代是砸.钱*养.钱的时代,各种姚晨杨幂葛优,各种地铁公车广告,各种卫视广告,铺天盖地~~~ 来谈php爬虫抓取信息~~ php爬虫首推Curl函数了,先来认识下它. 0x01.curl扩展的安装: 1.确保php子文件夹ext里面有php_curl.dll(一般都有的,一般配置时候会设置环境变量的) 2.将php.ini里面的;extension=php

python爬虫主要就是五个模块:爬虫启动入口模块,URL管理器存放已经爬虫的URL和待爬虫URL列表,html下载器,html解析器,html输出器 同时可以掌握到urllib2的使用、bs4(BeautifulSoup)页面解析器、re正则表达式、urlparse、python基础知识回顾(set集合操作)等相关内容。

本次python爬虫百步百科,里面详细分析了爬虫的步骤,对每一步代码都有详细的注释说明,可通过本案例掌握python爬虫的特点: 1.爬虫调度入口(crawler_main.py) # coding:utf-8from com.wenhy.crawler_baidu_baike import url_manager, html_downloader, html_parser, html_outputer print "爬虫百度百科调度入口" # 创建爬虫类class SpiderMai

浅谈爬虫 《一》 ===python

浅谈爬虫 <一> ===python ?''正文之前先啰嗦一下,准确来说,在下还只是一个刚入门IT世界的菜鸟,工作近两年了,之前做前端的时候就想写博客来着,现在都转做python了,如果还不开始写点什么,估计时间都不会原谅这么懒散的我了-- 闲话到此,下面说正事儿--首先来个爬虫简介 ??咋一听挺神秘的样子,简单来讲爬虫就是从网络获取资源,比如你想知道淘宝上的女装什么颜色的销量好,或者哪一款零食比较有赚头儿-- ??在说现在流行的人工智能,其实所谓的人工智能也就是足够的数据支撑,以及数据标记等

03,Python网络爬虫第一弹《Python网络爬虫相关基础概念》

引入 为什么要学习爬虫,学习爬虫能够为我们以后的发展带来那些好处?其实学习爬虫的原因和为我们以后发展带来的好处都是显而易见的,无论是从实际的应用还是从就业上. 我们都知道,当前我们所处的时代是大数据的时代,在大数据时代,要进行数据分析,首先要有数据源,而学习爬虫,可以让我们获取更多的数据源,并且这些数据源可以按我们的目的进行采集. 优酷推出的火星情报局就是基于网络爬虫和数据分析制作完成的.其中每期的节目话题都是从相关热门的互动平台中进行相关数据的爬取,然后对爬取到的数据进行数据分析而得来的.另一

架构师初码邀你—浅聊上云思路

架构师初码邀你-浅聊上云思路 话题发布专家:初码--资深程序员,初级架构师,知名博主 在实践中,无论是个人站长还是中小企业,选择云服务,都会遇到如何上云的问题,这个问题的详细描述就是,应该选择怎样一种迁移和部署到云服务上的方式并且应该使用哪些云服务?今天我们抛开具体的技术和语言不谈,就可能出现的架构方式简单的聊一聊,以我个人的经验来看,有较多的遇到过如下3种上云场景 一.傻瓜型: 在云服务时代前,但凡提到建站,多数所指即使用虚拟主机搭建一个内容管理网站或者论坛社区网站,操作者只需要知晓简单的虚拟

Requests爬虫和scrapy框架多线程爬虫

1.基于Requests和BeautifulSoup的单线程爬虫 1.1 BeautifulSoup用法总结 1. find,获取匹配的第一个标签 tag = soup.find('a') print(tag) tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie') tag = soup.find(name='a', class_='sister', recursive=True, te

爬虫学习 04.Python网络爬虫之requests模块(1)

爬虫学习 04.Python网络爬虫之requests模块(1) 引入 Requests 唯一的一个非转基因的 Python HTTP 库,人类可以安全享用. 警告:非专业使用其他 HTTP 库会导致危险的副作用,包括:安全缺陷症.冗余代码症.重新发明轮子症.啃文档症.抑郁.头疼.甚至死亡. 今日概要 基于requests的get请求 基于requests模块的post请求 基于requests模块ajax的get请求 基于requests模块ajax的post请求 综合项目练习:爬取国家药品监

爬虫学习 06.Python网络爬虫之requests模块(2)

爬虫学习 06.Python网络爬虫之requests模块(2) 今日内容 session处理cookie proxies参数设置请求代理ip 基于线程池的数据爬取 知识点回顾 xpath的解析流程 bs4的解析流程 常用xpath表达式 常用bs4解析方法 了解cookie和session - 无状态的http协议 如上图所示,HTTP协议 是无状态的协议,用户浏览服务器上的内容,只需要发送页面请求,服务器返回内容.对于服务器来说,并不关心,也并不知道是哪个用户的请求.对于一般浏览性的网页来说

爬虫学习 16.Python网络爬虫之Scrapy框架(CrawlSpider)

爬虫学习 16.Python网络爬虫之Scrapy框架(CrawlSpider) 引入 提问:如果想要通过爬虫程序去爬取"糗百"全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬取进行实现(Request模块递归回调parse方法). 方法二:基于CrawlSpider的自动爬取进行实现(更加简洁和高效). 今日概要 CrawlSpider简介 CrawlSpider使用 基于CrawlSpider爬虫文件的创建 链接提取器 规则解析器 今日详