一个python爬虫工具类

写了一个爬虫工具类。

# -*- coding: utf-8 -*-
# @Time    : 2018/8/7 16:29
# @Author  : cxa
# @File    : utils.py
# @Software: PyCharm
from retrying import retry
from decorators.decorators import decorator
from glom import glom
from config import headers
import datetime
import hashlib
@retry(stop_max_attempt_number=3, wait_fixed=2000, stop_max_delay=10000)
@decorator
def post_html(session,post_url:int,post_data:dict,headers=headers,timeout=30):
    ‘‘‘

    :param session: 传入session对象
    :param post_url: post请求需要的url
    :param headers: 报头信息,config模块默认提供
    :param post_data: post信息 字典类型
    :param timeout:
    :return:
    ‘‘‘
    post_req=session.post(url=post_url,headers=headers,data=post_data,timeout=timeout)
    if post_req.status_code==200:
        post_req.encoding=post_req.apparent_encoding
        return post_req

@retry(stop_max_attempt_number=3,wait_fixed=2000, stop_max_delay=10000)
@decorator
def get_response(session,url:str,headers=headers,timeout=30):
    ‘‘‘
    :param url:
    :return: return response object
    ‘‘‘
    req=session.get(url=url,headers=headers,timeout=timeout)
    if req.status_code==200:
        req.encoding=req.apparent_encoding
        return req

@decorator
def get_html(req):
    source=req.text
    return source

@decorator
def get_json(req):
    jsonstr=req.json()
    return jsonstr

@decorator
def get_xpath(req,xpathstr:str):
    ‘‘‘
    :param req:
    :param xpathstr:
    :return:
    ‘‘‘
    node=req.html.xpath(xpathstr)
    return node

@decorator
def get_json_data(jsonstr:str,pat:str):
    ‘‘‘
    #通过glom模块操作数据
    :param jsonstr:
    :param pat:
    :return:
    ‘‘‘
    item=glom(jsonstr,pat)
    return item

@decorator
def get_hash_code(key):
    value=hashlib.md5(key.encode(‘utf-8‘)).hexdigest()
    return value

@decorator
def get_datetime_from_unix(unix_time):
    unix_time_value=unix_time
    if not isinstance(unix_time_value,int):
        unix_time_value=int(unix_time)
    new_datetime=datetime.datetime.fromtimestamp(unix_time_value)
    return new_datetime

以下是装饰器decorators文件的内容

# -*- coding: utf-8 -*-
# @Time    : 2018/03/28 15:35
# @Author  : cxa
# @File    : decorators.py
# @Software: PyCharm
from functools import wraps
from logger.log import get_logger
import traceback
def decorator(func):
    @wraps(func)
    def log(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except Exception as e:
            get_logger().error("{} is error,here are details:{}".format(func.__name__,traceback.format_exc()))
    return log

以下是headers文件的内容

import random

first_num = random.randint(55, 62)
third_num = random.randint(0, 3200)
fourth_num = random.randint(0, 140)

class FakeChromeUA:
    os_type = [
                ‘(Windows NT 6.1; WOW64)‘, ‘(Windows NT 10.0; WOW64)‘, ‘(X11; Linux x86_64)‘,
                ‘(Macintosh; Intel Mac OS X 10_12_6)‘
               ]

    chrome_version = ‘Chrome/{}.0.{}.{}‘.format(first_num, third_num, fourth_num)

    @classmethod
    def get_ua(cls):
        return ‘ ‘.join([‘Mozilla/5.0‘, random.choice(cls.os_type), ‘AppleWebKit/537.36‘,
                         ‘(KHTML, like Gecko)‘, cls.chrome_version, ‘Safari/537.36‘]
                        )

headers = {
    ‘User-Agent‘: FakeChromeUA.get_ua(),
    ‘Accept-Encoding‘: ‘gzip, deflate, sdch‘,
    ‘Accept-Language‘: ‘zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3‘,
    ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,
    ‘Connection‘: ‘keep-alive‘
}

以下是logger文件的内容

# -*- coding: utf-8 -*-
import os
import time
import logging
import sys
log_dir1=os.path.join(os.path.dirname(os.path.dirname(__file__)),"logs")
today = time.strftime(‘%Y%m%d‘, time.localtime(time.time()))
full_path=os.path.join(log_dir1,today)
if not os.path.exists(full_path):
    os.makedirs(full_path)
log_path=os.path.join(full_path,"t.log")
def get_logger():
     # 获取logger实例，如果参数为空则返回root logger
     logger = logging.getLogger("t")
     if not logger.handlers:
            # 指定logger输出格式
            formatter = logging.Formatter(‘%(asctime)s %(levelname)-8s: %(message)s‘)

            # 文件日志
            file_handler = logging.FileHandler(log_path,encoding="utf8")
            file_handler.setFormatter(formatter)  # 可以通过setFormatter指定输出格式

            # 控制台日志
            console_handler = logging.StreamHandler(sys.stdout)
            console_handler.formatter = formatter  # 也可以直接给formatter赋值

            # 为logger添加的日志处理器
            logger.addHandler(file_handler)
            logger.addHandler(console_handler)

            # 指定日志的最低输出级别，默认为WARN级别
            logger.setLevel(logging.INFO)
     #  添加下面一句，在记录日志之后移除句柄
     return  logger

原文地址：https://www.cnblogs.com/c-x-a/p/9438587.html

时间： 2024-08-16 23:01:46

一个python爬虫工具类的相关文章

第一个Python爬虫脚本

今天看了一下买来的C#项目书,感觉有点不可理喻,简直就是作者用来圈钱的,视频敷衍了事,源代码莫名其妙...唉...不过今天还是学了新东西,是一个Python爬虫脚本,虽说也是云里雾里,但是也算一个小进步,千里之行始于足下么,下面就把代码给贴出来. import urllib.requestimport urllib.parseimport json content = input('please input what you want to translate : ') url = 'http:

Android 分享一个SharedPreferences的工具类,方便保存数据

我们平常保存一些数据,都会用到SharedPreferences,他是保存在手机里面的,具体路径是data/data/你的包名/shared_prefs/保存的文件名.xml, SharedPreferences的使用也很简单,我自己就写了一个SharedPreferences的工具类,然后就保存在这里,等自己以后需要保存数据直接从这里copy代码,哈哈工具类如下 [java] view plaincopy package com.example.shortcut; import androi

一个python爬虫小程序

起因深夜忽然想下载一点电子书来扩充一下kindle,就想起来python学得太浅,什么“装饰器”啊.“多线程”啊都没有学到. 想到廖雪峰大神的python教程很经典.很著名.就想找找有木有pdf版的下载,结果居然没找到!!CSDN有个不完整的还骗走了我一个积分!!尼玛!! 怒了,准备写个程序直接去爬廖雪峰的教程,然后再html转成电子书. 过程过程很有趣呢,用浅薄的python知识,写python程序,去爬python教程,来学习python.想想有点小激动…… 果然python很是方便,5

我的第一个Python爬虫——谈心得

2019年3月27日,继开学到现在以来,开了软件工程和信息系统设计,想来想去也没什么好的题目,干脆就想弄一个实用点的,于是产生了做“学生服务系统”想法.相信各大高校应该都有本校APP或超级课程表之类的软件,在信息化的时代能快速收集/查询自己想要的咨询也是种很重要的能力,所以记下了这篇博客,用于总结我所学到的东西,以及用于记录我的第一个爬虫的初生先给大家分享一门我之前看过的课程,挺不错的,免费分享给大家 Python爬虫工程师必学 App数据抓取实战,内容官网:https://coding.im

自己用反射写的一个request.getParameter工具类

适用范围:当我们在jsp页面需要接收很多值的时候,如果用request.getParameter(属性名)一个一个写的话那就太麻烦了,于是我想是否能用反射写个工具类来简化这样的代码,经过1个小时的代码修改调试,终于雏形出来了,很高兴调试成功,呵呵,代码贴出来. package com.letv.uts2.utcServer.util; import org.slf4j.Logger;import org.slf4j.LoggerFactory; import java.lang.reflect

一个redis使用工具类

package com.cheng.common.util.cache; import java.util.ArrayList; import java.util.HashSet; import java.util.Iterator; import java.util.List; import java.util.Map; import java.util.Set; import java.util.concurrent.TimeUnit; import javax.annotation.Pos

我的第一个python爬虫程序(从百度贴吧自动下载图片)

这个学期开设了编译原理和形式语言与自动机,里面都有介绍过正则表达式,今天自己学了学用python正则表达式写爬虫一.网络爬虫的定义网络爬虫,即Web Spider,是一个很形象的名字. 把互联网比喻成一个蜘蛛网,那么Spider就是在网上爬来爬去的蜘蛛. 网络蜘蛛是通过网页的链接地址来寻找网页的. 从网站某一个页面(通常是首页)开始,读取网页的内容,找到在网页中的其它链接地址, 然后通过这些链接地址寻找下一个网页,这样一直循环下去,直到把这个网站所有的网页都抓取完为止. 如果把整个互联网当成

第一个python爬虫程序

1.安装Python环境官网https://www.python.org/下载与操作系统匹配的安装程序,安装并配置环境变量 2.IntelliJ Idea安装Python插件我用的idea,在工具中直接搜索插件并安装(百度) 3.安装beautifulSoup插件 https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#attributes 4.爬虫程序:爬博客园的闪存内容 #!/usr/bin/python # -*- codin

一些Python爬虫工具

爬虫可以简单分为三步骤:请求数据.解析数据和存储数据 .主要的一些工具如下: 请求相关 request 一个阻塞式http请求库. Selenium Selenium是一个自动化测试工具,可以驱动浏览器执行特定的动作,如点击,下拉等操作.对于一些javascript渲染的页面,这种抓取方式非常有效. ChromeDriver.GeckoDriver 只有安装了ChromeDriver和GeckoDriver之后,Selenium才能驱动Chrome或者Firefox浏览器来做相应的网页抓取. P