1.项目准备:网站地址:http://quanzhou.tianqi.com/
2.创建编辑Scrapy爬虫:
scrapy startproject weather
scrapy genspider HQUSpider quanzhou.tianqi.com
项目文件结构如图:
3.修改Items.py:
4.修改Spider文件HQUSpider.py:
(1)先使用命令:scrapy shell http://quanzhou.tianqi.com/ 测试和获取选择器:
(2)试验选择器:打开chrome浏览器,查看网页源代码:
(3)执行命令查看response结果:
(4)编写HQUSpider.py文件:
# -*- coding: utf-8 -*-import scrapyfrom weather.items import WeatherItem class HquspiderSpider(scrapy.Spider): name = ‘HQUSpider‘ allowed_domains = [‘tianqi.com‘] citys=[‘quanzhou‘,‘datong‘] start_urls = [] for city in citys: start_urls.append(‘http://‘+city+‘.tianqi.com/‘) def parse(self, response): subSelector=response.xpath(‘//div[@class="tqshow1"]‘) items=[] for sub in subSelector: item=WeatherItem() cityDates=‘‘ for cityDate in sub.xpath(‘./h3//text()‘).extract(): cityDates+=cityDate item[‘cityDate‘]=cityDates item[‘week‘]=sub.xpath(‘./p//text()‘).extract()[0] item[‘img‘]=sub.xpath(‘./ul/li[1]/img/@src‘).extract()[0] temps=‘‘ for temp in sub.xpath(‘./ul/li[2]//text()‘).extract(): temps+=temp item[‘temperature‘]=temps item[‘weather‘]=sub.xpath(‘./ul/li[3]//text()‘).extract()[0] item[‘wind‘]=sub.xpath(‘./ul/li[4]//text()‘).extract()[0] items.append(item) return items
(5)修改pipelines.py我,处理Spider的结果:
# -*- coding: utf-8 -*- # Define your item pipelines here## Don‘t forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport timeimport os.pathimport urllib2import sysreload(sys)sys.setdefaultencoding(‘utf8‘) class WeatherPipeline(object): def process_item(self, item, spider): today=time.strftime(‘%Y%m%d‘,time.localtime()) fileName=today+‘.txt‘ with open(fileName,‘a‘) as fp: fp.write(item[‘cityDate‘].encode(‘utf-8‘)+‘\t‘) fp.write(item[‘week‘].encode(‘utf-8‘)+‘\t‘) imgName=os.path.basename(item[‘img‘]) fp.write(imgName+‘\t‘) if os.path.exists(imgName): pass else: with open(imgName,‘wb‘) as fp: response=urllib2.urlopen(item[‘img‘]) fp.write(response.read()) fp.write(item[‘temperature‘].encode(‘utf-8‘)+‘\t‘) fp.write(item[‘weather‘].encode(‘utf-8‘)+‘\t‘) fp.write(item[‘wind‘].encode(‘utf-8‘)+‘\n\n‘) time.sleep(1) return item
(6)修改settings.py文件,决定由哪个文件来处理获取的数据:
(7)执行命令:scrapy crawl HQUSpider
到此为止,一个完整的Scrapy爬虫就完成了。
时间: 2024-10-21 04:19:59