爬取知名社区技术文章_pipelines_4

获取字段的存储处理和获取普通的路径

#!/usr/bin/python3
# -*- coding: utf-8 -*-

import pymysql
import gevent
import pymysql
from gevent import monkey
from scrapy.pipelines.images import ImagesPipeline
from twisted.enterprise import adbapi
import pymysql.cursors

class JobboleImagerPipeline(ImagesPipeline):
    """
    获得图片下载路径
    """
    def item_completed(self, results, item, info):
        if ‘img_url‘ in item:
            for key, value in results:
                # print(key)
                img_path = value[‘path‘]
                # print(value[‘path‘])
                item[‘img_path‘] = img_path
        return item

# class SqlSave(object):
#     """常规同步方式存入数据库"""
#     def __init__(self):
#         SQL_DBA = {
#             ‘host‘: ‘localhost‘,
#             ‘db‘: ‘jobole‘,
#             ‘user‘: ‘root‘,
#             ‘password‘: ‘jiayuan95814‘,
#             ‘use_unicode‘: True,
#             ‘charset‘: ‘utf8‘
#         }
#         self.conn = pymysql.connect(**SQL_DBA)
#         self.cursor = self.conn.cursor()
#
#     def process_item(self, item, spider):
#         sql = self.get_sql(item)
#         print(sql)
#         self.cursor.execute(sql)
#         self.conn.commit()
#
#         return item
#
#     def get_sql(self, item):
#         sql = """insert into article(cont_id, cont_url, title, publish_time, cont, img_url, img_path, like_num, collection_num, comment_num) value (‘%s‘,‘%s‘,‘%s‘,‘%s‘,‘%s‘,‘%s‘,‘%s‘, %d, %d, %d)
#         """ % (item[‘cont_id‘], item[‘cont_url‘],item[‘title‘],item[‘publish_time‘],item[‘cont‘],item[‘img_url‘][0],item[‘img_path‘],item[‘link_num‘],item[‘collection_num‘],item[‘comment_num‘],)
#         return sql

class SqlSave(object):
    """
    协程方式向数据库插入数据
    """

    def __init__(self):
        # 初始数据库连接和参数，SQL_DBA可写在setting中，通过 获取在settings.py中设置的SQL_DBA字典
        # @classmethod
        # def from_settings(cls, settings):
        #     sql_dba = settings[SQL_DBA]
        #     return cls(cls，sql_dba)           需要__init__中新添个参数接收这个值
        SQL_DBA = {
            ‘host‘: ‘localhost‘,
            ‘db‘: ‘jobole‘,
            ‘user‘: ‘root‘,
            ‘password‘: ‘jiayuan95814‘,
            ‘use_unicode‘: True,
            ‘charset‘: ‘utf8‘
        }
        self.conn = pymysql.connect(**SQL_DBA)
        self.cursor = self.conn.cursor()

    def process_item(self, item, spider):
        sql = self.__get_sql(item)
        # 协程方式对数据库插入操作
        gevent.joinall([
            gevent.spawn(self.__go_sql, self.cursor, self.conn, sql, item),
        ])
        return item

    def __go_sql(self, cursor, conn, sql, item):
        try:
            # 数据库插入操作
            cursor.execute(sql,
                           (item[‘cont_id‘], item[‘cont_url‘], item[‘title‘], item[‘publish_time‘],
                            item[‘cont‘], item[‘img_url‘][0], item[‘img_path‘], item[‘link_num‘],
                            item[‘collection_num‘], item[‘comment_num‘]))
            conn.commit()
        except Exception as e:
            print(e)

    def __get_sql(self, item):
        # 生成sql语句
        sql = """insert into
                  article(cont_id, cont_url, title, publish_time,
                  cont, img_url, img_path, like_num,
                  collection_num, comment_num)
                value
                  (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""
        return sql

时间： 2024-10-13 01:33:21

爬取知名社区技术文章_pipelines_4的相关文章

爬取知名社区技术文章_分析_1

软件运行环境是什么? python 3.50 -- 解释器 scrapy库 -- 爬虫框架 pymsql库 -- 连接mysql数据库 pillow库 -- 下载图片目标网站是什么

item中定义获取的字段和原始数据进行处理并合法化数据 #!/usr/bin/python3 # -*- coding: utf-8 -*- import scrapy import hashlib import re from scrapy.loader.processors import (MapCompose, TakeFirst, Join) from scrapy.loader import ItemLoader def go_md5(value): # 对cont_url进行md5,

爬取知名社区技术文章_setting_5

# -*- coding: utf-8 -*- # Scrapy settings for JobBole project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/t

爬取知名社区技术文章_article_3

爬虫主逻辑处理,获取字段,获取主url和子url #!/usr/bin/python3 # -*- coding: utf-8 -*- import scrapy from scrapy.http import Request from urllib import parse from JobBole.items import JobboleItem, ArticleItemLoader class ExampleSpider(scrapy.Spider): name = 'jobbole' #

使用IP代理池和用户代理池爬取糗事百科文章

简单使用IP代理池和用户代理池的爬虫 import re import random import urllib.request as urlreq import urllib.error as urlerr #用户代理池 uapools = [ "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0", "Mozilla/5.0 (Windows NT 10.0; Win64; x

Python爬虫新手教程：爬取了6574篇文章，告诉你产品经理在看什么！

作为互联网界的两个对立的物种,产品汪与程序猿似乎就像一对天生的死对头:但是在产品开发链条上紧密合作的双方,只有通力合作,才能更好地推动项目发展.那么产品经理平日里面都在看那些文章呢?我们程序猿该如何投其所好呢?我爬取了人人都是产品经理栏目下的所有文章,看看产品经理都喜欢看什么. 1. 分析背景 1.1. 为什么选择「人人都是产品经理」人人都是产品经理是以产品经理.运营为核心的学习.交流.分享平台,集媒体.培训.招聘.社群为一体,全方位服务产品人和运营人,成立8年举办在线讲座500+期,线下分享

Python爬取CSDN博客文章

之前解析出问题,刚刚看到,这次仔细审查了 0 url :http://blog.csdn.net/youyou1543724847/article/details/52818339Redis一点基础的东西目录 1.基础底层数据结构 2.windows下环境搭建 3.java里连接redis数据库 4.关于认证 5.redis高级功能总结1.基础底层数据结构1.1.简单动态字符串SDS定义: ...47分钟前1 url :http://blog.csdn.net/youyou1543724847/

使用Scrapy来爬取自己的CSDN文章

前言爬虫作为一中数据搜集获取手段,在大数据的背景下,更加得到应用.我在这里只是记录学习的简单的例子.大牛可以直接使用python的url2模块直接抓下来页面,然后自己使用正则来处理,我这个技术屌丝只能依赖于框架,在这里我使用的是Scrapy. install 首先是python的安装和pip的安装. sudo apt-get install python python-pip python-dev 然后安装Scrapy sudo pip install Scrapy 在安装Scrapy的过程中

爬虫爬取“吟”的技术博客

下午事情少,顺便把昨天的爬虫练习下,平时都看磊的技术博哥(干货比较多):就试试先写一个简单的爬虫,后期有机会再完善,做整站和多线程. 1.观察爬取的URL: 通过观察我们发现,在首页部分包含有文章的标题列表,然后思路就是:通过这一页的url可以获取所有文章标题,再通过标题获取到文章的URL,在通过RUL下载: 观察这一页的URL为: http://dl528888.blog.51cto.com/2382721/p-1:第二页往后类推就是p-2..p-*,这样就很容易把整站都爬下来(这里只是取第一