scrapy爬虫案例数据存入MongoDB

爬虫py文件

# -*- coding: utf-8 -*-
import scrapy
from ..items import RtysItem

class RtSpider(scrapy.Spider):
    name = 'rt'      #爬虫名,启动项目时用
    # allowed_domains = ['www.baidu.com']     #定义爬虫范围  注释掉就可以
    start_urls = ['https://www.woyaogexing.com/touxiang/']    #起始url 项目启动时,会自动向url发起请求
    def parse(self, response):  # response直接代替响应
        div_list=response.xpath('//div[@class="list-left z"]/div[2]/div')  #解析数据
        for i in div_list:
            name = i.xpath('./a/text()').extract_first() #变量名 要与items.py中实例化的变量名一样
            img_url = i.xpath('./a/img/@src').extract_first()
            lianjie_url = i.xpath('./a/@href').extract_first()
            items = RtysItem()  #实例化items
            items['name']=name    #将实例化的字段存进字典中
            items['img_url']=img_url
            items['lianjie_url']=lianjie_url
            yield items  #发送给管道

pipelines.py 文件

            # -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo

class RtysPipeline(object):
    def process_item(self, item, spider):
        coon = pymongo.MongoClient('localhost',27017)  #连接mongodb数据库
        db = coon.rtys  #创建数据库 有的话就直接用 没有就相当于创建
        table = db.rt   #创建表 有的话就直接用 没有就相当于创建
        table.insert_one(dict(item))  #查入一条数据 转化成字典
        return item

存入Mongo时要注意settings.py的配置 注释部分需要打开

settings.py文件


# -*- coding: utf-8 -*-

# Scrapy settings for rtys project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'rtys'

SPIDER_MODULES = ['rtys.spiders']
NEWSPIDER_MODULE = 'rtys.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False   #False 爬的网站不受限制   True爬的网站受限制

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'rtys.middlewares.RtysSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

篡改ip的时候需要打开中间件
#DOWNLOADER_MIDDLEWARES = {
#    'rtys.middlewares.RtysDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {     #需要注开
   'rtys.pipelines.RtysPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class RtysItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()  #设置要爬取的字段名 爬几个就写几个
    img_url = scrapy.Field()
    lianjie_url = scrapy.Field()
    pass

原文地址:https://www.cnblogs.com/pp8080/p/12191213.html

时间: 2024-08-03 08:46:24

scrapy爬虫案例数据存入MongoDB的相关文章

scrapy爬虫案例:用MongoDB保存数据

用Pymongo保存数据 爬取豆瓣电影top250movie.douban.com/top250的电影数据,并保存在MongoDB中. items.py class DoubanspiderItem(scrapy.Item): # 电影标题 title = scrapy.Field() # 电影评分 score = scrapy.Field() # 电影信息 content = scrapy.Field() # 简介 info = scrapy.Field() spiders/douban.py

抓取新浪微博数据存入MongoDB,避免重复插入微博数据的方法

def getMyDatalist(): #id这个key key = str(u'id').decode('utf-8') #存储旧数据的id列表 old_ids = [] #存储新微博的列表 extr_wb = [] #从MongoDB上获取的数据 old_datalist = weibodata.find() for old in old_datalist: old_ids.append(old[key]) #从微博上抓取新数据 data = client.statuses.home_ti

scrapy爬虫案例

一个简单的爬虫案例 from scrapy_redis.spiders import RedisSpider import os,urllib.request,time class XiaohuaSpider(scrapy.Spider): name = 'xiaohua' allowed_domains = ['90xiaohua.com'] start_urls = ['http://90xiaohua.com//'] file_path = r'D:\python_code\spider\

爬虫框架Scrapy之将数据存在Mongodb

用Pymongo保存数据 爬取豆瓣电影top250movie.douban.com/top250的电影数据,并保存在MongoDB中. items.py class DoubanspiderItem(scrapy.Item): # 电影标题 title = scrapy.Field() # 电影评分 score = scrapy.Field() # 电影信息 content = scrapy.Field() # 简介 info = scrapy.Field() spiders/douban.py

Scrapy爬虫案例01——翻页爬取

之前用python写爬虫,都是自己用requests库请求,beautifulsoup(pyquery.lxml等)解析.没有用过高大上的框架.早就听说过Scrapy,一直想研究一下.下面记录一下我学习使用Scrapy的系列代码及笔记. 安装 Scrapy的安装很简单,官方文档也有详细的说明 http://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/install.html .这里不详细说明了. 创建工程 我是用的是pycharm开发,打开pycharm

scrapy中把数据写入mongodb

1.setting.py中打开管道 ITEM_PIPELINES = { # 'tianmao.pipelines.TianmaoPipeline': 300, } 2.setting.py中写入mongodb配置 # mongodb HOST = "127.0.0.1" # 服务器地址 PORT = 27017 # mongo默认端口号 USER = "用户名" PWD = "密码" DB = "数据库名" TABLE =

Flume 1.5日志采集并存入mongodb的安装搭建

Flume的介绍就不多说了,大家可以自己搜索.但是目前网上大都是Flume 1.4版本或之前的资料,Flume 1.5感觉变化挺大的,如果你准备尝试一下,我这里给大家介绍一下最小化搭建方案,并且使用MongoSink将数据存入mongodb.完全单机运行,没有master,没有collector(说白了collector也就是一个agent,只是数据来源于多个其他agent),只有一个agent.把这套东西理解了你就可以自由发挥了 Flume是必须要求java运行环境的哈,jdk安装就不解释了,

运维学python之爬虫高级篇(七)scrapy爬取知乎关注用户存入mongodb

首先,祝大家开工大吉!本篇将要介绍的是从一个用户开始,通过抓关注列表和粉丝列表,实现用户的详细信息抓取并将抓取到的结果存储到 MongoDB. 1 环境需求 基础环境沿用之前的环境,只是增加了MongoDB(非关系型数据库)和PyMongo(Python 的 MongoDB 连接库),默认我认为大家都已经安装好并启动 了MongoDB 服务. 项目创建.爬虫创建.禁用ROBOTSTXT_OBEY设置略(可以参考上一篇) 2 测试爬虫效果 我这里先写一个简单的爬虫,爬取用户的关注人数和粉丝数,代码

Scrapy 爬虫框架入门案例详解

欢迎大家关注腾讯云技术社区-博客园官方主页,我们将持续在博客园为大家推荐技术精品文章哦~ 作者:崔庆才 Scrapy入门 本篇会通过介绍一个简单的项目,走一遍Scrapy抓取流程,通过这个过程,可以对Scrapy对基本用法和原理有大体的了解,作为入门. 在本篇开始之前,假设已经安装成功了Scrapy,如果尚未安装,请参照上一节安装课程. 本节要完成的任务有: 创建一个Scrapy项目 创建一个Spider来抓取站点和处理数据 通过命令行将抓取的内容导出 创建项目 在抓取之前,你必须要先创建一个S