1. 任务一,抓取以下两个URL的内容,写入文件
http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/
项目截图
和上一个project不同的是,在spider中没有定义rules属性,而是定义了parse方法。这个方法告诉scrapy抓取start urls的内容后应该怎么做。第一个任务我们仅把内容写入文件。
2.任务二:在scrapy shell中练习使用xpath
在项目的顶级目录输入:
scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
有一个叫response的loval变量,存储了shell在加载的时候抓取的内容
练习简单的xpath
xpath方法返回的是一系列selector, 你可以继续调selector 的xpath方法,做更深入的挖掘
最后,使用ctrl + z 或输入两次quit推出shell
3.任务三,从response中选取 title, link和desc并输出到控制台。
为此需要改写我们的spider
import scrapy class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = ["http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"] def parse(self, response): for sel in response.xpath(‘//ul/li‘): title = sel.xpath(‘a/text()‘).extract() link = sel.xpath(‘a/@href‘).extract() desc = sel.xpath(‘text()‘).extract() print title, link, desc
4.任务四:将 title, link 和desc以json的形式写入文件
改写项目顶层目录的items.py
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html import scrapy class DmozItem(scrapy.Item): title = scrapy.Field() link = scrapy.Field() desc = scrapy.Field()
改写spider
__author__ = ‘DB‘ import scrapy from project002.items import DmozItem class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = ["http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"] def parse(self, response): for sel in response.xpath(‘//ul/li‘): item = DmozItem() item[‘title‘] = sel.xpath(‘a/text()‘).extract() item[‘link‘] = sel.xpath(‘a/@href‘).extract() item[‘desc‘] = sel.xpath(‘text()‘).extract() yield item
再次运行spider:
scrapy crawl dmoz -o items.json
时间: 2024-11-05 19:30:56