Scrapy框架实现持久化存储

硬盘存储

(1) 基于终端指令

* 保证parse方法返回一个可迭代类型的对象(存储解析到页面内容)
* 使用终端指定完成数据存储到磁盘文件的操作
scrapy crawl 爬虫文件名称 -o 磁盘文件.后缀

def parse(self, response):
    # 建议使用xpath进行解析（框架集成了xpath解析的接口）
    div_list = response.xpath(‘//div[@id="content-left"]/div ‘)
    # 存储解析到的页面数据
    data_list = []
    for div in div_list:
        # xpath解析到的指定内容存储到了Selector对象
        # extract()该方法可以将Selector对象存储中存储的数据值拿到
        author = div.xpath(‘./div/a[2]/h2/text()‘).extract_first()
        # extract_first = extract()[0]
        content = div.xpath(‘.//div[@class="content"]/span/text()‘).extract_first()
        data_dict = {
            ‘author‘:author,
            ‘content‘:content
        }
        data_list.append(data_dict)
    return data_list

(2) 基于管道

* items: 存储解析到的页面数据
* piplines：处理持久化存储的相关操作
* 代码流程：

将解析到的页面数据存储到items对象
使用yield关键字将items提交给管道文件进行处理
在管道文件中编写代码完成数据存储的操作
在配置文件开启管道操作

class QiubaiSpider(scrapy.Spider):
    name = ‘qiubai‘
    # allowed_domains = [‘www.qiushibaike.com/text‘]
    start_urls = [‘https://www.qiushibaike.com/text/‘]

    def parse(self, response):
        # 建议使用xpath进行解析（框架集成了xpath解析的接口）
        div_list = response.xpath(‘//div[@id="content-left"]/div ‘)
        # 存储解析到的页面数据
        data_list = []
        for div in div_list:
            # xpath解析到的指定内容存储到了Selector对象
            # extract()该方法可以将Selector对象存储中存储的数据值拿到
            author = div.xpath(‘./div/a[2]/h2/text()‘).extract_first()
            # extract_first = extract()[0]
            content = div.xpath(‘.//div[@class="content"]/span/text()‘).extract_first()

            # 将解析到的数据值(author和content)存储到items对象
            item = SpiderqiubaiItem()
            item[‘author‘] = author
            item[‘content‘] = content
            # 将item对象提交给管道
            yield item

qiubai.py

import scrapy
class SpiderqiubaiItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author = scrapy.Field()
    content = scrapy.Field()

items.py

class SpiderqiubaiPipeline(object):
    fp = None

    # 在整个爬虫过程中，该方法只会在开始爬虫的时候被调用一次
    def open_spider(self, spider):
        print(‘开始爬虫‘)
        self.fp = open(‘./qiubai_pipe.txt‘, ‘w‘, encoding=‘utf-8‘)

    # 该方法可以接收爬虫文件提交过来的对象，并且对item对象中存储的页面数据进行持久化存储
    # 参数：item表示的就是接收到的item对象
    # 每当爬虫文件向管道提交一次item 则该方法就会执行一次
    def process_item(self, item, spider):
        # 取出item对象中存储的数据值
        author = item[‘author‘]
        content = item[‘content‘]
        # 持久化存储
        self.fp.write(author+":"+content+"\n\n\n")
        return item

    # 该方法只会在爬虫结束的时候被调用一次
    def close_spider(self, spider):
        print("爬虫结束")
        self.fp.close()

piplines.py

数据库存储

* 代码流程：

将解析到的页面数据存储到items对象
使用yield关键字将items提交给管道文件进行处理
在管道文件中编写代码完成数据存储(存入数据库)的操作
在配置文件开启管道操作

class SpiderqiubaiPipeline(object):
    conn = None

    # 在整个爬虫过程中，该方法只会在开始爬虫的时候被调用一次
    def open_spider(self, spider):
        # 连接数据库
        self.conn = pymysql.Connect(host=‘192.168.1.10‘, port=3306, user=‘root‘, password=‘cs1993413‘, db=‘qiubai‘)

    # 该方法可以接收爬虫文件提交过来的对象，并且对item对象中存储的页面数据进行持久化存储
    # 参数：item表示的就是接收到的item对象
    # 每当爬虫文件向管道提交一次item 则该方法就会执行一次
    def process_item(self, item, spider):
        # 1 连接数据库
        # 2 执行sql语句
        sql = ‘insert into qiubai values("%s", "%s")‘ %(item[‘author‘], item[‘content‘])
        self.cursor = self.conn.cursor()
        try:
            self.cursor.execute(sql)

            self.conn.commit()
        except Exception as e:
            self.conn.rollback()
        # 3 提交事务
        # 取出item对象中存储的数据值
        return item

    # 该方法只会在爬虫结束的时候被调用一次
    def close_spider(self, spider):
       self.conn.close()

redis存储

class SpiderqiubaiPipeline(object):
    conn = None

    # 在整个爬虫过程中，该方法只会在开始爬虫的时候被调用一次
    def open_spider(self, spider):
        # 连接数据库
        self.conn = redis.Redis(host=‘192.168.1.10‘, port=6379)

    def process_item(self, item, spider):
        data_dict = {
            ‘author‘: item[‘author‘],
            ‘content‘: item[‘content‘]
        }
        self.conn.lpush(‘data‘, data_dict)
        return item

pipline高级操作

将数据同时存在本地以及数据库和redis上

# 将数据值存储到本地磁盘中
class QiubaiByFiels(object):
    fp = None

    def open_spider(self, spider):
        print(‘开始爬虫‘)
        self.fp = open(‘./qiubai_pipe.txt‘, ‘w‘, encoding=‘utf-8‘)

    def process_item(self, item, spider):
        author = item[‘author‘]
        content = item[‘content‘]
        self.fp.write(author + ":" + content + "\n\n\n")
        return item

    def close_spider(self, spider):
        print("爬虫结束")
        self.fp.close()

将数据值存储到mysql数据库中

class QiubaiByMysql(object):
    conn = None

    def open_spider(self, spider):
        self.conn = pymysql.Connect(host=‘192.168.1.10‘, port=3306, user=‘root‘, password=‘cs1993413‘, db=‘qiubai‘)

    def process_item(self, item, spider):
        sql = ‘insert into qiubai values("%s", "%s")‘ % (item[‘author‘], item[‘content‘])
        self.cursor = self.conn.cursor()
        try:
            self.cursor.execute(sql)

            self.conn.commit()
        except Exception as e:
            self.conn.rollback()
        return item

    def close_spider(self, spider):
        self.conn.close()

settings.py

ITEM_PIPELINES = {
   ‘spiderqiubai.pipelines.SpiderqiubaiPipeline‘: 300,
   ‘spiderqiubai.pipelines.QiubaiByMysql‘: 200,
   ‘spiderqiubai.pipelines.QiubaiByFiels‘: 100,
}

原文地址：https://www.cnblogs.com/harryblog/p/11356354.html

时间： 2024-10-08 12:21:40

Scrapy框架实现持久化存储的相关文章

（六--二）scrapy框架之持久化操作

scrapy框架之持久化操作基于终端指令的持久化存储基于管道的持久化存储 1 基于终端指令的持久化存储保证爬虫文件的parse方法中有可迭代类型对象(通常为列表or字典)的返回,该返回值可以通过终端指令的形式写入指定格式的文件中进行持久化操作. 执行输出指定格式进行存储:将爬取到的数据写入不同格式的文件中进行存储 scrapy crawl 爬虫名称 -o xxx.json scrapy crawl 爬虫名称 -o xxx.xml scrapy crawl 爬虫名称 -o xxx.csv 以

scrapy框架之持久化操作

1.基于终端指令的持久化存储保证爬虫文件的parse方法中有可迭代类型对象(通常为列表or字典)的返回,该返回值可以通过终端指令的形式写入指定格式的文件中进行持久化操作. 执行输出指定格式进行存储:将爬取到的数据写入不同格式的文件中进行存储 scrapy crawl 爬虫名称 -o xxx.json scrapy crawl 爬虫名称 -o xxx.xml scrapy crawl 爬虫名称 -o xxx.csv 2.基于管道的持久化存储 scrapy框架中已经为我们专门集成好了高效.便捷的持

12. scrapy 框架持续化存储

一. 基于终端指令的持久化存储保证爬虫文件的parse方法中有可迭代类型对象(通常为列表or字典)的返回,该返回值可以通过终端指令的形式写入指定格式的文件中进行持久化操作执行输出指定格式进行存储:将爬取到的数据写入不同格式的文件中进行存储: scrapy crawl 爬虫名称 -o xxx.json scrapy crawl 爬虫名称 -o xxx.xml scrapy crawl 爬虫名称 -o xxx.csv 示例: 原文地址:https://www.cnblogs.com/mwhylj

Scrapy 框架，持久化文件相关

持久化相关相关文件 items.py 数据结构模板文件.定义数据属性. pipelines.py 管道文件.接收数据(items),进行持久化操作. 持久化流程 1.爬虫文件爬取到数据后,需要将数据封装到 items对象中. 2.使用 yield 关键字将items对象提交给 pipelines 管道进行持久化操作. 3.在管道文件中的 process_item 方法中接收爬虫文件提交过来的item对象,然后编写持久化存储的代码将item对象中存储的数据进行持久化存储 4.settings.

Scrapy框架基础应用和持久化存储

一.Scrapy框架的基础应用 1.Scrapy的概念 Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架,非常出名,非常强悍.所谓的框架就是一个已经被集成了各种功能(高性能异步下载,队列,分布式,解析,持久化等)的具有很强通用性的项目模板. 2.安装 windows: a. pip3 install wheel b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted c. 进入下载目录,执行 pip3 in

scrapy 框架持久化存储

1.基于终端的持久化存储保证爬虫文件的parse方法中有可迭代类型对象(通常为列表或字典)的返回,该返回值可以通过终端指令的形式写入指定格式的文件中进行持久化操作. # 执行输出指定格式进行存储:将爬到的数据写入不同格式的文件中进行存储 scrapy crawl <爬虫名称> -o xxx.json scrapy crawl <爬虫名称> -o xxx.xml scrapy crawl <爬虫名称> -o xxx.csv 2.基于管道的持久化存储 scrapy框架中已

【Scrapy框架持久化存储】 -- 2019-08-08 20:40:10

原文: http://106.13.73.98/__/138/ 基于终端指令的持久化存储前提:保证爬虫文件中的parse方法的返回值为可迭代数据类型(通常为list/dict). 该返回值可以通过终端指令的形式写入指定格式的文件中进行持久化存储. 执行如下命令进行持久化存储: scrapy crawl 应用名称 -o xx.文件格式其支持的文件格式有:'json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle' 基于管道的持久化存储

【Scrapy框架持久化存储】 򏪕

原文: http://blog.gqylpy.com/gqy/363 " 基于终端指令的持久化存储前提:保证爬虫文件中的parse方法的返回值为可迭代数据类型(通常为list/dict). 该返回值可以通过终端指令的形式写入指定格式的文件中进行持久化存储. 执行如下命令进行持久化存储: scrapy crawl 应用名称 -o xx.文件格式其支持的文件格式有:'json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle' 基于管道的