Scrapy框架: middlewares.py设置

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

class DownloadtestSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

class DownloadtestDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

import random

#设置请求头
class RandomUA():
    def __init__(self):
        self.user_agent=[
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.1 Safari/605.1.16',
            'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0',
            'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)'
        ]

    def process_request(self, request, spider):
        request.headers['User-Agent'] = random.choice(self.user_agent)

    def process_response(self, request, response, spider):
        response.status=201
        return response

#设置代理
class ProxyMiddleware:
    proxy_list=[
        "http://127.0.0.1:8080"
        # "http://183.91.33.41:80",
        # "http://111.29.3.190:80",
        # "http://124.156.108.71:82"
    ]

    def process_request(self, request, spider):
        ip=random.choice(self.proxy_list)
        request.meta['proxy']=ip

原文地址：https://www.cnblogs.com/hankleo/p/11829704.html

时间： 2024-08-30 08:55:35

Scrapy框架: middlewares.py设置的相关文章

Scrapy框架: pipelines.py设置

保存数据到json文件 # -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html from scrapy.exporters import JsonItemExporter

基于scrapy框架的爬虫

Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架. 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中. scrapy 框架高性能的网络请求高性能的数据解析高性能的持久化存储深度爬取全站爬取分布式中间件请求传参环境的安装 mac/linux:pip install scrapy windows: pip install wheel twisted(异步相关,scrapy 的异步由twisted实现) 一定要在twisted安装成功的情况下执行后面的

scrapy框架的中间件

中间件的使用作用:拦截所有的请求和响应拦截请求:process_request拦截正常的请求,process_exception拦截异常的请求篡改请求的头信息 def process_request(self, request, spider): print('proces_request!!!') #UA伪装 request.headers['User-Agent'] = random.choice(self.user_agent_list) return None # user_age

scrapy框架设置代理

网易音乐在单ip请求下经常会遇到网页返回码503的情况经查询,503为单个ip请求流量超限,猜测是网易音乐的一种反扒方式因原音乐下载程序采用scrapy框架,所以需要在scrapy中通过代理的方式去解决此问题在scrapy中使用代理,有两种使用方式 1.使用中间件2.直接设置Request类的meta参数下面依次简要说明下如何使用方式一:使用中间件要进行下面两步操作在文件 settings.py 中激活代理中间件ProxyMiddleware在文件 middlewares.py 中实现类P

【Scrapy框架设置UA池与代理池】 -- 2019-08-08 17:20:36

原创: http://106.13.73.98/__/142/ 先来张Scrapy框架图压压惊下载中间件(Downloader Middlewares)是位于Scrapy引擎和下载器之间的一层组件. 它的作用是: 在引擎将请求传递给下载器的过程中,下载中间件可以对请求进行一系列的处理.比如设置User-Agent.设置代理IP等. 在下载器将Response传递给引擎的过程中,下载中间件可以对响应进行一系列的处理.比如进行gzip解压等. 下面将使用下载中间件来实现UA池与代理池我们一般使用

【Scrapy框架设置UA池与代理池】 -- 2019-08-08 18:00:10

原文: http://106.13.73.98/__/142/ 先来张Scrapy框架图压压惊下载中间件(Downloader Middlewares)是位于Scrapy引擎和下载器之间的一层组件. 它的作用是: 在引擎将请求传递给下载器的过程中,下载中间件可以对请求进行一系列的处理.比如设置User-Agent.设置代理IP等. 在下载器将Response传递给引擎的过程中,下载中间件可以对响应进行一系列的处理.比如进行gzip解压等. 下面将使用下载中间件来实现UA池与代理池我们一般使用

【Scrapy框架设置UA池与代理池】 񊺟

原文: http://blog.gqylpy.com/gqy/367 先来张Scrapy框架图压压惊下载中间件(Downloader Middlewares)是位于Scrapy引擎和下载器之间的一层组件. 它的作用是: 在引擎将请求传递给下载器的过程中,下载中间件可以对请求进行一系列的处理.比如设置User-Agent.设置代理IP等. 在下载器将Response传递给引擎的过程中,下载中间件可以对响应进行一系列的处理.比如进行gzip解压等. 下面将使用下载中间件来实现UA池与代理池我们一

网络爬虫之scrapy框架详解,scrapy框架设置代理

twisted介绍 Twisted是用Python实现的基于事件驱动的网络引擎框架,scrapy正是依赖于twisted, 它是基于事件循环的异步非阻塞网络框架,可以实现爬虫的并发. twisted是什么以及和requests的区别: request是一个python实现的可以伪造浏览器发送Http请求的模块,它封装了socket发送请求 twisted是基于时间循环的异步非阻塞的网络框架,它也封装了socket发送请求,但是他可以单线程的完成并发请求. twisted的特点是: 非阻塞:不等待

网络爬虫之scrapy框架设置代理

前戏 os.environ()简介 os.environ()可以获取到当前进程的环境变量,注意,是当前进程. 如果我们在一个程序中设置了环境变量,另一个程序是无法获取设置的那个变量的. 环境变量是以一个字典的形式存在的,可以用字典的方法来取值或者设置值. os.environ() key字段详解 windows: os.environ['HOMEPATH']:当前用户主目录. os.environ['TEMP']:临时目录路径. os.environ[PATHEXT']:可执行文件. os.en