爬虫之scrapy框架应用selenium

一、利用selenium 爬取网易军事新闻

使用流程:

‘‘‘
在scrapy中使用selenium的编码流程:
    1.在spider的构造方法中创建一个浏览器对象(作为当前spider的一个属性)
    2.重写spider的一个方法closed(self,spider),在该方法中执行浏览器关闭的操作
    3.在下载中间件的process_response方法中,通过spider参数获取浏览器对象
    4.在中间件的process_response中定制基于浏览器自动化的操作代码(获取动态加载出来的页面源码数据)
    5.实例化一个响应对象,且将page_source返回的页面源码封装到该对象中
    6.返回该新的响应对象
‘‘‘

首先需要在中间件导入

from scrapy.html import HtmlResponse

DownloadMiddleware函数

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest

        # 获取动态加载出来的数据
        print("即将返回一个新的响应对象")
        bw = spider.bw
        bw.get(url = request.url)
        import time
        # 防止数据加载过慢
        time.sleep(3)
        # 包含了动态加载的数据
        page_text = bw.page_source
        time.sleep(3)
        return HtmlResponse(url = spider.bw.current_url,body=page_text,encoding="utf8",request=request)

spider.py

# -*- coding: utf-8 -*-
import scrapy
from selenium import webdriver

class ScrapySeleniumSpider(scrapy.Spider):
    name = ‘scrapy_selenium‘
    # allowed_domains = [‘www.xxx.com‘]
    start_urls = [‘http://war.163.com/‘]
    def __init__(self):
        self.bw = webdriver.Chrome(executable_path="F:\爬虫+数据\chromedriver.exe")

    def parse(self, response):
        div_list = response.xpath(‘//div[@class="data_row news_article clearfix "]‘)
        for div in div_list:
            title = div.xpath(‘.//div[@class="news_title"]/h3/a/text()‘).extract_first()
            print(title)

    def closed(self, spider):
        print(‘关闭浏览器对象!‘)
        self.bw.quit()

还需要注意的是使用中间件的同时需要在settings中解释一下Downloadmiddleware

结果是这样就成功喽

原文地址：https://www.cnblogs.com/qq631243523/p/10472189.html

时间： 2024-10-10 18:05:21

爬虫之scrapy框架应用selenium的相关文章

Python网络爬虫之Scrapy框架（CrawlSpider）

目录 Python网络爬虫之Scrapy框架(CrawlSpider) CrawlSpider使用爬取糗事百科糗图板块的所有页码数据 Python网络爬虫之Scrapy框架(CrawlSpider) 提问:如果想要通过爬虫程序去爬取"糗百"全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬取进行实现(Request模块递归回调parse方法). 方法二:基于CrawlSpider的自动爬取进行实现(更加简洁和高效). CrawlSpider使

网络爬虫之scrapy框架详解,scrapy框架设置代理

twisted介绍 Twisted是用Python实现的基于事件驱动的网络引擎框架,scrapy正是依赖于twisted, 它是基于事件循环的异步非阻塞网络框架,可以实现爬虫的并发. twisted是什么以及和requests的区别: request是一个python实现的可以伪造浏览器发送Http请求的模块,它封装了socket发送请求 twisted是基于时间循环的异步非阻塞的网络框架,它也封装了socket发送请求,但是他可以单线程的完成并发请求. twisted的特点是: 非阻塞:不等待

爬虫学习 16.Python网络爬虫之Scrapy框架（CrawlSpider）

爬虫学习 16.Python网络爬虫之Scrapy框架(CrawlSpider) 引入提问:如果想要通过爬虫程序去爬取"糗百"全站数据新闻数据的话,有几种实现方法? 方法一:基于Scrapy框架中的Spider的递归爬取进行实现(Request模块递归回调parse方法). 方法二:基于CrawlSpider的自动爬取进行实现(更加简洁和高效). 今日概要 CrawlSpider简介 CrawlSpider使用基于CrawlSpider爬虫文件的创建链接提取器规则解析器今日详

爬虫5 scrapy框架2 全站爬取cnblogs, scarpy请求传参, 提高爬取效率, 下载中间件, 集成selenium, fake-useragent, 去重源码分析, 布隆过滤器, 分布式爬虫, java等语言概念补充, bilibili爬视频参考

1 全站爬取cnblogs # 1 scrapy startproject cnblogs_crawl # 2 scrapy genspider cnblogs www.cnblogs.com 示例: # cnblogs_crawl/cnblogs_crawl/spiders/cnblogs.py import scrapy from cnblogs_crawl.items import CnblogsCrawlItem from scrapy.http import Request class

爬虫之scrapy框架应用selenium

一、利用selenium 爬取网易军事新闻

爬虫之scrapy框架应用selenium的相关文章

Python网络爬虫之Scrapy框架（CrawlSpider）

网络爬虫之scrapy框架详解,scrapy框架设置代理

爬虫学习 16.Python网络爬虫之Scrapy框架（CrawlSpider）

爬虫5 scrapy框架2 全站爬取cnblogs, scarpy请求传参, 提高爬取效率, 下载中间件, 集成selenium, fake-useragent, 去重源码分析, 布隆过滤器, 分布式爬虫, java等语言概念补充, bilibili爬视频参考

python爬虫之scrapy框架

Python爬虫进阶(Scrapy框架爬虫)

Requests爬虫和scrapy框架多线程爬虫

python爬虫随笔-scrapy框架(1)——scrapy框架的安装和结构介绍

爬虫之scrapy框架

爬虫之scrapy框架应用selenium

一、利用selenium 爬取 网易军事新闻

爬虫之scrapy框架应用selenium的相关文章

一、利用selenium 爬取网易军事新闻