scrapy startproject bmw
cd bmw
scrapy genspider bmw5 ‘autohome.com.cn‘
第一种方式:不使用ImagePipeline
bww5.py:
1 import scrapy 2 from bmw.items import BmwItem 3 4 5 class Bmw5Spider(scrapy.Spider): 6 name = ‘bmw5‘ 7 allowed_domains = [‘autohome.com.cn‘] 8 start_urls = [‘https://car.autohome.com.cn/pic/series/65.html‘] 9 10 def parse(self, response): 11 uiboxs = response.xpath(‘//div[@class = "uibox"]‘)[1:] 12 for uibox in uiboxs: 13 category = uibox.xpath(‘.//div[@class = "uibox-title"]/a/text()‘).get() 14 urls = uibox.xpath(‘.//ul/li/a/img/@src‘).getall() 15 urls = list(map(lambda url: response.urljoin(url), urls)) 16 item = BmwItem(category=category, urls=urls) 17 yield item
items.py:
1 import scrapy 2 3 4 class BmwItem(scrapy.Item): 5 # define the fields for your item here like: 6 # name = scrapy.Field() 7 category=scrapy.Field() 8 urls=scrapy.Field()
settings.py部分设置:
1 ITEM_PIPELINES = { 2 ‘bmw.pipelines.BmwPipeline‘: 300, 3 }
pipelines.py:
1 import os 2 from urllib import request 3 4 class BmwPipeline(object): 5 def __init__(self): 6 self.path = os.path.join(os.path.dirname(__file__), ‘images‘) 7 if not os.path.exists(self.path): 8 os.mkdir(self.path) 9 10 def process_item(self, item, spider): 11 category = item[‘category‘] 12 urls = item[‘urls‘] 13 category_path = os.path.join(self.path, category) 14 if not os.path.exists(category_path): 15 os.mkdir(category_path) 16 for url in urls: 17 image_name = url.split(‘_‘)[-1] 18 request.urlretrieve(url, os.path.join(category_path, image_name)) 19 return item
第二种:通过ImagesPipeline来保存图片
步骤:
1.定义好一个Item,然后在这个item中定义两个属性,分别为:image_urls和images images_urls是用来存储需要下载的图片的url链接,需要给一个列表2.当文件下载完成后,会把文件下载相关信息存储到item的images属性中,比如下载路径,下载的url和图片的校验码等3.在配置文件settings.py中配置IMAGES_STORE,这个配置是用来设置图片下载下来的路径 在配置文件settings.py中配置IMAGES_URLS_FIELD,这个配置是设置图片路径的item字段名 (注:特别重要,不然图片文件夹为空)4.启动pipeline:在ITEM_PIPELINES中设置scrapy.pipelines.images.ImagesPipeline:1
改写pipelines.py:
1 import os 2 from scrapy.pipelines.images import ImagesPipeline 3 from bmw import settings 4 5 class BMWImagesPipeline(ImagesPipeline): # 继承ImagesPipeline 6 # 该方法在发送下载请求前调用,本身就是发送下载请求的 7 def get_media_requests(self, item, info): 8 request_objects = super(BMWImagesPipeline, self).get_media_requests(item, info) # super()直接调用父类对象 9 for request_object in request_objects: 10 request_object.item = item 11 return request_objects 12 13 def file_path(self, request, response=None, info=None): 14 path = super(BMWImagesPipeline, self).file_path(request, response, info) 15 # 该方法是在图片将要被存储时调用,用于获取图片存储的路径 16 category = request.item.get(‘category‘) 17 images_stores = settings.IMAGES_STORE # 拿到IMAGES_STORE 18 category_path = os.path.join(images_stores, category) 19 if not os.path.exists(category_path): # 判断文件名是否存在,如果不存在创建文件 20 os.mkdir(category_path) 21 image_name = path.replace(‘full/‘, ‘‘) 22 image_path = os.path.join(category_path, image_name) 23 return image_path
改写settings.py:
1 import os 2 IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)), ‘imgs‘) 3 IMAGES_URLS_FIELD=‘urls‘ 4 ITEM_PIPELINES = {5 ‘bmw.pipelines.BMWImagesPipeline‘: 1, }
pycharm运行scrapy需要在项目文件夹下新建一个start.py:
1 from scrapy import cmdline 2 3 cmdline.execute([‘scrapy‘, ‘crawl‘, ‘bmw5‘])
原文地址:https://www.cnblogs.com/min-R/p/10545408.html
时间: 2024-10-08 22:19:55