scrapy结合selenium爬取淘宝等动态网站

1.首先创建爬虫项目

2.进入爬虫

class TaobaoSpider(scrapy.Spider):    name = ‘taobao‘    allowed_domains = [‘taobao.com‘]　　#拿一个笔记本键盘做示例    start_urls = [‘https://s.taobao.com/search?initiative_id=tbindexz_20170306&ie=utf8&spm=a21bo.2017.201856-taobao-item.2&sourceId=tb.index&search_type=item&ssid=s5-e&commend=all&imgfile=&q=%E7%AC%94%E8%AE%B0%E6%9C%AC%E7%94%B5%E8%84%91&suggest=0_1&_input_charset=utf-8&wq=%E7%AC%94%E8%AE%B0%E6%9C%AC&suggest_query=%E7%AC%94%E8%AE%B0%E6%9C%AC&source=suggest‘]　　#接下来，定义初始化函数　　def __init__(self):　　　　super(TaobaoSpider,self).__init__()　　　　self.driver = webdriver.PhantomJS()   #在这里，我用幽灵浏览器，当然也可以用Firefox()和Chrome() 火狐和谷歌浏览器　　#然后，开始解析源码　　def parse(self, response):　　　　div_info = response.xpath(‘//div[@class="info-cont"]‘)　　　　for div in div_info　　　　　　title = div.xpath(‘div[@class="title-row"]/a/text()‘).extract_first(‘‘)　　　　　　price = div.xpath(‘div[contains(@class, "sale-row")]/div/span[contains(@class, "price")]/strong/text()‘).extract_first(‘‘)　　　　　　print ‘名称：‘, title, ‘价格：‘, price　　　　#关闭爬虫并关闭浏览器　　def closed(self,reason):　　　　print u‘爬虫关闭了, 原因：‘,reason　　　　self.driver.quit()写到这，爬虫类函数写完了，然后需要去设置middlewares中间件

import timefrom selenium import webdriverfrom scrapy.http.response.html import HtmlResponsefrom scrapy.http.response import Response需要这几个模块重写downloadMiddleware这个类

　class SeleniumRequestDownloadMiddleWare(object):
　　　　super(SeleniumRequestDownloadMiddleWare, self).__init__()RequestDownloadMiddleWare(object):

　　　　self.driver = webdriver.PhantomJS()

　def process_request(self,request,spider)

　　　if spider.name ==‘taobao‘:

　　　　spider.driver.get(request.url)

　　　　#设置滚动条，往下拉页面获取源码

　　　　for x in xrange(1,11,2)：

　　　　　　i = float(x)/10

　　　　　　js = "document.body.scrollTop=document.body.scrollHeight * %f"%i

　　　　　　spider.driver.execute_script(js)

　　　　　　time.sleep(1) #需要设置等待时间1秒，不然加载缓慢的话，不出数据

　　　　　response = Response(url = request.url,body=bytes(spider.driver.page_source),request = request)

　　　return response 　else:　　　pass

原文地址：https://www.cnblogs.com/star-god/p/8379790.html

时间： 2024-10-28 11:38:15

scrapy结合selenium爬取淘宝等动态网站的相关文章

用selenium爬取淘宝美食

'''利用selenium爬取淘宝美食网页内容''' import re from selenium import webdriver from selenium.common.exceptions import TimeoutException from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.su

爬虫实例之selenium爬取淘宝美食

这次的实例是使用selenium爬取淘宝美食关键字下的商品信息,然后存储到MongoDB. 首先我们需要声明一个browser用来操作,我的是chrome.这里的wait是在后面的判断元素是否出现时使用,第二个参数为等待最长时间,超过该值则抛出异常. browser = webdriver.Chrome() wait = WebDriverWait(browser,10) 声明好之后就需要进行打开网页.进行搜索的操作. #使用webdriver打开chrome,打开淘宝页面,搜索美食关键字,返回

利用Selenium爬取淘宝商品信息

一. Selenium和PhantomJS介绍 Selenium是一个用于Web应用程序测试的工具,Selenium直接运行在浏览器中,就像真正的用户在操作一样.由于这个性质,Selenium也是一个强大的网络数据采集工具,其可以让浏览器自动加载页面,这样,使用了异步加载技术的网页,也可获取其需要的数据. Selenium模块是Python的第三方库,可以通过pip进行安装: pip3 install selenium Selenium自己不带浏览器,需要配合第三方浏览器来使用.通过help命

使用Selenium爬取淘宝商品

import pymongo from selenium import webdriver from selenium.common.exceptions import TimeoutException from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.wait im

Selenium爬取淘宝商品概要入mongodb

准备: 1.安装Selenium:终端输入 pip install selenium 2.安装下载Chromedriver:解压后放在…\Google\Chrome\Application\:如果是Mac,可放入/usr/locl/bin,并将此目录放入环境变量 3.安装pyquery:终端输入 pip install pyquery 4.安装pymongo:终端输入 pip install pymongo 5.安装MongoDB的PyCharm插件:Preferences——Plugins——

scrapy+selenium 爬取淘宝

# -*- coding: utf-8 -*- import scrapy from scrapy import Request from urllib.parse import quote from ..items import ScrapyseleniumtestItem class TaobaoSpider(scrapy.Spider): name = 'tao_bao' allowed_domains = ['www.taobao.com'] base_url = 'https://s.

selenium跳过webdriver检测并爬取淘宝我已购买的宝贝数据

简介上一个博文已经讲述了如何使用selenium跳过webdriver检测并爬取天猫商品数据,所以在此不再详细讲,有需要思路的可以查看另外一篇博文. 源代码 # -*- coding: utf-8 -*- from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.w

python基础项目实战:selenium控制浏览器爬取淘宝商品信息

今天为大家介绍一个Python利用selenium打开浏览器的方式来爬取淘宝商品的信息,下面就来看看,关于selenium的知识点,是如何做到控制浏览器获取网站的信息导入第三方库关键词搜索抓取索引页大家在学python的时候肯定会遇到很多难题,以及对于新技术的追求,这里推荐一下我们的Python学习扣qun:784758214,这里是python学习者聚集地!!同时,自己是一名高级python开发工程师,从基础的python脚本到web开发.爬虫.django.数据挖掘等,零基础到项目实

selenium抓取淘宝数据报错:warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless

ssh://[email protected]:22/root/anaconda3/bin/python3 -u /www/python3/maoyantop100/meishi_selenium.py /root/anaconda3/lib/python3.6/site-packages/selenium/webdriver/phantomjs/webdriver.py:49: UserWarning: Selenium support for PhantomJS has been depre