scrapy爬虫案例数据存入MongoDB

爬虫py文件

# -*- coding: utf-8 -*-
import scrapy
from ..items import RtysItem

class RtSpider(scrapy.Spider):
    name = 'rt'      #爬虫名，启动项目时用
    # allowed_domains = ['www.baidu.com']     #定义爬虫范围  注释掉就可以
    start_urls = ['https://www.woyaogexing.com/touxiang/']    #起始url 项目启动时，会自动向url发起请求
    def parse(self, response):  # response直接代替响应
        div_list=response.xpath('//div[@class="list-left z"]/div[2]/div')  #解析数据
        for i in div_list:
            name = i.xpath('./a/text()').extract_first() #变量名 要与items.py中实例化的变量名一样
            img_url = i.xpath('./a/img/@src').extract_first()
            lianjie_url = i.xpath('./a/@href').extract_first()
            items = RtysItem()  #实例化items
            items['name']=name    #将实例化的字段存进字典中
            items['img_url']=img_url
            items['lianjie_url']=lianjie_url
            yield items  #发送给管道

pipelines.py 文件

            # -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo

class RtysPipeline(object):
    def process_item(self, item, spider):
        coon = pymongo.MongoClient('localhost',27017)  #连接mongodb数据库
        db = coon.rtys  #创建数据库 有的话就直接用 没有就相当于创建
        table = db.rt   #创建表 有的话就直接用 没有就相当于创建
        table.insert_one(dict(item))  #查入一条数据 转化成字典
        return item

存入Mongo时要注意settings.py的配置注释部分需要打开

settings.py文件


# -*- coding: utf-8 -*-

# Scrapy settings for rtys project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'rtys'

SPIDER_MODULES = ['rtys.spiders']
NEWSPIDER_MODULE = 'rtys.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False   #False 爬的网站不受限制   True爬的网站受限制

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'rtys.middlewares.RtysSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

篡改ip的时候需要打开中间件
#DOWNLOADER_MIDDLEWARES = {
#    'rtys.middlewares.RtysDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {     #需要注开
   'rtys.pipelines.RtysPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class RtysItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()  #设置要爬取的字段名 爬几个就写几个
    img_url = scrapy.Field()
    lianjie_url = scrapy.Field()
    pass

原文地址：https://www.cnblogs.com/pp8080/p/12191213.html

时间： 2024-10-09 21:12:27

scrapy爬虫案例数据存入MongoDB的相关文章

scrapy爬虫案例：用MongoDB保存数据

用Pymongo保存数据爬取豆瓣电影top250movie.douban.com/top250的电影数据,并保存在MongoDB中. items.py class DoubanspiderItem(scrapy.Item): # 电影标题 title = scrapy.Field() # 电影评分 score = scrapy.Field() # 电影信息 content = scrapy.Field() # 简介 info = scrapy.Field() spiders/douban.py

抓取新浪微博数据存入MongoDB，避免重复插入微博数据的方法

def getMyDatalist(): #id这个key key = str(u'id').decode('utf-8') #存储旧数据的id列表 old_ids = [] #存储新微博的列表 extr_wb = [] #从MongoDB上获取的数据 old_datalist = weibodata.find() for old in old_datalist: old_ids.append(old[key]) #从微博上抓取新数据 data = client.statuses.home_ti

scrapy爬虫案例

一个简单的爬虫案例 from scrapy_redis.spiders import RedisSpider import os,urllib.request,time class XiaohuaSpider(scrapy.Spider): name = 'xiaohua' allowed_domains = ['90xiaohua.com'] start_urls = ['http://90xiaohua.com//'] file_path = r'D:\python_code\spider\

爬虫框架Scrapy之将数据存在Mongodb

Scrapy爬虫案例01——翻页爬取

之前用python写爬虫,都是自己用requests库请求,beautifulsoup(pyquery.lxml等)解析.没有用过高大上的框架.早就听说过Scrapy,一直想研究一下.下面记录一下我学习使用Scrapy的系列代码及笔记. 安装 Scrapy的安装很简单,官方文档也有详细的说明 http://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/install.html .这里不详细说明了. 创建工程我是用的是pycharm开发,打开pycharm

scrapy中把数据写入mongodb

1.setting.py中打开管道 ITEM_PIPELINES = { # 'tianmao.pipelines.TianmaoPipeline': 300, } 2.setting.py中写入mongodb配置 # mongodb HOST = "127.0.0.1" # 服务器地址 PORT = 27017 # mongo默认端口号 USER = "用户名" PWD = "密码" DB = "数据库名" TABLE =

Flume 1.5日志采集并存入mongodb的安装搭建

Flume的介绍就不多说了,大家可以自己搜索.但是目前网上大都是Flume 1.4版本或之前的资料,Flume 1.5感觉变化挺大的,如果你准备尝试一下,我这里给大家介绍一下最小化搭建方案,并且使用MongoSink将数据存入mongodb.完全单机运行,没有master,没有collector(说白了collector也就是一个agent,只是数据来源于多个其他agent),只有一个agent.把这套东西理解了你就可以自由发挥了 Flume是必须要求java运行环境的哈,jdk安装就不解释了,

运维学python之爬虫高级篇（七）scrapy爬取知乎关注用户存入mongodb

首先,祝大家开工大吉!本篇将要介绍的是从一个用户开始,通过抓关注列表和粉丝列表,实现用户的详细信息抓取并将抓取到的结果存储到 MongoDB. 1 环境需求基础环境沿用之前的环境,只是增加了MongoDB(非关系型数据库)和PyMongo(Python 的 MongoDB 连接库),默认我认为大家都已经安装好并启动了MongoDB 服务. 项目创建.爬虫创建.禁用ROBOTSTXT_OBEY设置略(可以参考上一篇) 2 测试爬虫效果我这里先写一个简单的爬虫,爬取用户的关注人数和粉丝数,代码

Scrapy 爬虫框架入门案例详解

欢迎大家关注腾讯云技术社区-博客园官方主页,我们将持续在博客园为大家推荐技术精品文章哦~ 作者:崔庆才 Scrapy入门本篇会通过介绍一个简单的项目,走一遍Scrapy抓取流程,通过这个过程,可以对Scrapy对基本用法和原理有大体的了解,作为入门. 在本篇开始之前,假设已经安装成功了Scrapy,如果尚未安装,请参照上一节安装课程. 本节要完成的任务有: 创建一个Scrapy项目创建一个Spider来抓取站点和处理数据通过命令行将抓取的内容导出创建项目在抓取之前,你必须要先创建一个S