项目名称:qidian
项目描述:利用scrapy抓取七点中文网的“完本榜”总榜的500本小说,抓取内容包括:小说名称,作者,类别,然后保存为CSV文件
目标URL:https://www.qidian.com/rank/fin?style=1
项目需求:
1.小说名称
2.作者
3.小说类别
第一步:在shell中创建项目
scrapy startproject qidian
第二步:根据项目需求编辑items.py
1 #-*- coding: utf-8 -*- 2 import scrapy 3 4 class QidianItem(scrapy.Item): 5 name = scrapy.Field() 6 author = scrapy.Field() 7 category = scrapy.Field()
第三步:进行页面分析,利用xpath或者css提取数据,创建并编辑spider.py
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from ..items import QidianItem 4 5 class QidianSpider(scrapy.Spider): 6 name = ‘qidian‘ 7 start_urls = [‘https://www.qidian.com/rank/fin?style=1&dateType=3‘] 8 9 def parse(self, response): 10 sel = response.xpath(‘//div[@class="book-mid-info"]‘) 11 for i in sel: 12 name = i.xpath(‘./h4/a/text()‘).extract_first() 13 author = i.xpath(‘./p[@class="author"]/a[1]/text()‘).extract_first() 14 category = i.xpath(‘./p[@class="author"]/a[last()]/text()‘).extract_first() 15 item = QidianItem() 16 item[‘name‘] = name 17 item[‘author‘] = author 18 item[‘category‘] = category 19 yield item
上面这里是一页的数据,接下来抓取一下页的连接(因为项目过于小巧,我认为没必要用到一些高大上的方法来实现,直接观察URL的构造规律就可以简单写出代码),下面是spider.py的完整代码
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from ..items import QidianItem 4 5 class QidianSpider(scrapy.Spider): 6 name = ‘qidian‘ 7 start_urls = [‘https://www.qidian.com/rank/fin?style=1&dateType=3‘] 8 n = 1 #第一页 9 10 def parse(self, response): 11 sel = response.xpath(‘//div[@class="book-mid-info"]‘) 12 for i in sel: 13 name = i.xpath(‘./h4/a/text()‘).extract_first() 14 author = i.xpath(‘./p[@class="author"]/a[1]/text()‘).extract_first() 15 category = i.xpath(‘./p[@class="author"]/a[last()]/text()‘).extract_first() 16 item = QidianItem() 17 item[‘name‘] = name 18 item[‘author‘] = author 19 item[‘category‘] = category 20 yield item21 22 if self.n < 25: 23 self.n += 1 #n表示页码 24 next_url = ‘https://www.qidian.com/rank/fin?style=1&dateType=3&page=%d‘ % self.n 25 yield scrapy.Request(next_url, callback = parse)
第四步:启动爬虫并保存数据
scrapy crawl qidian -o qidian.csv
原文地址:https://www.cnblogs.com/Alfred-ou/p/9326296.html
时间: 2024-11-05 22:40:14