使用scrapy爬取知乎图片

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for zhihutupian project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = ‘zhihutupian‘

SPIDER_MODULES = [‘zhihutupian.spiders‘]
NEWSPIDER_MODULE = ‘zhihutupian.spiders‘

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = ‘zhihutupian (+http://www.yourdomain.com)‘
USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
#LOG_LEVEL = "ERROR"

IMAGES_STORE = ‘./imgsLib‘  #自动创建文件夹

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,
#   ‘Accept-Language‘: ‘en‘,
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    ‘zhihutupian.middlewares.ZhihutupianSpiderMiddleware‘: 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    ‘zhihutupian.middlewares.ZhihutupianDownloaderMiddleware‘: 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    ‘scrapy.extensions.telnet.TelnetConsole‘: None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   ‘zhihutupian.pipelines.ImgproPipeline‘: 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = ‘httpcache‘
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = ‘scrapy.extensions.httpcache.FilesystemCacheStorage‘

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy.pipelines.images import ImagesPipeline
import scrapy
class ImgproPipeline(ImagesPipeline):
    #ImgproPipeline
    def get_media_requests(self,item,info):   #注意此处使用的是get_media_requests,注意末尾的s
        img_src = item[‘img_src‘]
        print(img_src)
        yield scrapy.Request(url=img_src,meta={‘item‘:item})

    def file_path(self,request,response=None,info=None):
        img_name=request.meta[‘item‘][‘img_name‘]

        return img_name

    def item_completed(self,request,item,info):
        return item

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class ZhihutupianItem(scrapy.Item):

    # define the fields for your item here like:
    img_name = scrapy.Field()
    img_src = scrapy.Field()

zhihu.py

# -*- coding: utf-8 -*-
import scrapy
from zhihutupian.items import ZhihutupianItem

class ZhihuSpider(scrapy.Spider):
    name = ‘zhihu‘
    #allowed_domains = [‘www.zhihu.com‘]
    start_urls = [‘https://www.zhihu.com/question/xxxxxx‘]
    i = 0
    def parse(self, response):
        div_list = response.xpath("//figure")    #解析到所有图片所在标签
        for img_src in div_list:

            img_name = str(self.i) + ‘.jpg‘
            src_div = img_src.xpath("./img/@data-original").extract_first()
            # print(src_div)
            item = ZhihutupianItem()
            item[‘img_name‘] = img_name
            item[‘img_src‘] = src_div
            #print(item[‘img_name‘])
            #print(item[‘img_src‘])
            self.i += 1
            yield item

imddlewares.py未做修改

注:利用此

原文地址：https://www.cnblogs.com/cuirenlao/p/12534273.html

时间： 2024-10-05 05:50:22

使用scrapy爬取知乎图片的相关文章

scrapy 爬取知乎问题、答案，并异步写入数据库（mysql）

python版本 python2.7 爬取知乎流程: 一 .分析在访问知乎首页的时候(https://www.zhihu.com),在没有登录的情况下,会进行重定向到(https://www.zhihu.com/signup?next=%2F)这个页面, 爬取知乎,首先要完成登录操作,登陆的时候观察往那个页面发送了post或者get请求.可以利用抓包工具来获取登录时密码表单等数据的提交地址. 1.利用抓包工具,查看用户名密码数据的提交地址页就是post请求,将表单数据提交的网址,经过查看

利用 Scrapy 爬取知乎用户信息

思路:通过获取知乎某个大V的关注列表和被关注列表,查看该大V和其关注用户和被关注用户的详细信息,然后通过层层递归调用,实现获取关注用户和被关注用户的关注列表和被关注列表,最终实现获取大量用户信息. 一.新建一个scrapy项目 scrapy startproject zhihuuser 移动到新建目录下: cd zhihuuser 新建spider项目: scrapy genspider zhihu 二.这里以爬取知乎大V轮子哥的用户信息来实现爬取知乎大量用户信息. a) 定义 spdier.p

scrapy爬取知乎问答

登陆参考 https://github.com/zkqiang/Zhihu-Login # -*- coding: utf-8 -*- import scrapy import time import re import base64 import hmac import hashlib import json import matplotlib.pyplot as plt from PIL import Image class ZhihuSpider(scrapy.Spider): name

运维学python之爬虫高级篇（七）scrapy爬取知乎关注用户存入mongodb

首先,祝大家开工大吉!本篇将要介绍的是从一个用户开始,通过抓关注列表和粉丝列表,实现用户的详细信息抓取并将抓取到的结果存储到 MongoDB. 1 环境需求基础环境沿用之前的环境,只是增加了MongoDB(非关系型数据库)和PyMongo(Python 的 MongoDB 连接库),默认我认为大家都已经安装好并启动了MongoDB 服务. 项目创建.爬虫创建.禁用ROBOTSTXT_OBEY设置略(可以参考上一篇) 2 测试爬虫效果我这里先写一个简单的爬虫,爬取用户的关注人数和粉丝数,代码

用scrapy爬取搜狗Lofter图片

# -*- coding: utf-8 -*- import json import scrapy from scrapy.http import Request from urllib import parse from scrapy.loader import ItemLoader from tutorial.items import LofterSpiderItem class LofterSpider(scrapy.Spider): name = "lofter" allowe

python 爬取知乎图片

SyntaxError: Non-UTF-8 code starting with '\xbf' in file python-zhihu -v1.2.py on line 34, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details 安装需要的模块 pip install requestspip install PyQuery pip show 命令检查模块是否安装成功(如图所示是成功的)

Scrapy爬取美女图片 (原创)

有半个月没有更新了,最近确实有点忙.先是华为的比赛,接着实验室又有项目,然后又学习了一些新的知识,所以没有更新文章.为了表达我的歉意,我给大家来一波福利... 今天咱们说的是爬虫框架.之前我使用python爬取慕课网的视频,是根据爬虫的机制,自己手工定制的,感觉没有那么高大上,所以我最近玩了玩 python中强大的爬虫框架Scrapy. Scrapy是一个用 Python 写的 Crawler Framework ,简单轻巧,并且非常方便.Scrapy 使用 Twisted 这个异步网络库来处理

Scrapy爬取美女图片续集 (原创)

上一篇咱们讲解了Scrapy的工作机制和如何使用Scrapy爬取美女图片,而今天接着讲解Scrapy爬取美女图片,不过采取了不同的方式和代码实现,对Scrapy的功能进行更深入的运用. 在学习Scrapy官方文档的过程中,发现Scrapy自身实现了图片和文件的下载功能,不需要咱们之前自己实现图片的下载(不过原理都一样). 在官方文档中,我们可以看到下面一些话:Scrapy为下载item中包含的文件(比如在爬取到产品时,同时也想保存对应的图片)提供了一个可重用的 item pipelines .

Scrapy爬取美女图片第三集代理ip(上) (原创)

首先说一声,让大家久等了.本来打算520那天进行更新的,可是一细想,也只有我这样的单身狗还在做科研,大家可能没心思看更新的文章,所以就拖到了今天.不过忙了521,522这一天半,我把数据库也添加进来了,修复了一些bug(现在肯定有人会说果然是单身狗). 好了,废话不多说,咱们进入今天的主题.上两篇 Scrapy爬取美女图片的文章,咱们讲解了scrapy的用法.可是就在最近,有热心的朋友对我说之前的程序无法爬取到图片,我猜应该是煎蛋网加入了反爬虫机制.所以今天讲解的就是突破反爬虫机制的上篇代理