python爬虫优化和错误日志分析

发现问题

在爬虫下载过程中，执行一段时间后都会异常终止，下次必须kill掉进程重新运行，看能否优化并减少手动操作

错误日志分析

收集了nohup.out文件，发现主要错误是的数组下标越界，推测可能的问题为：
1）网络不稳定，http请求不通。
2）网络请求成功，但是html表单解析失败。
3）登录的cookie过期

优化思路

在所有有网络请求的地方，都加上了返回码是不是200的判断，然后html表单解析的地方加上数组长度判断，异常处理等

源码如下

import socket
import time
import os
from datetime import datetime
import re
import yaml
import requests
from bs4 import BeautifulSoup

# 设置超时时间为10s
socket.setdefaulttimeout(10)
s = requests.Session()

# 登录
def login():
    url = host_url + "j_spring_security_check"

    data = {
        "username": bzh_host_usr,
        "password": bzh_host_pwd
    }

    try:
        response = s.post(url, data=data, headers=headers)
        if response.status_code == 200:
            cookie = response.cookies.get_dict()
            print("login success")
            return cookie
    except Exception as e:
        print("login fail：", e)

# 页码
def get_pages():
    try:
        response = s.get(noticeListUrl, data=paramsNotice, headers=headers, cookies=cookie)
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, "html.parser")
            pageCount = int(soup.find('span', id='PP_countPage').get_text())
            pageCount = pageCount if pageCount > 1 else 1
            return pageCount
    except Exception as e:
        print("get page_count fail：", e)

# 文档ids
def get_ids(pageCount):
    ids = []
    for p in range(int(pageCount)):
        paramsNotice['pageIndex'] = p + 1

        try:
            response = s.get(noticeListUrl, data=paramsNotice, headers=headers, cookies=cookie)
            if response.status_code == 200:
                soup = BeautifulSoup(response.text, "html.parser")
                trs = soup.find("table", class_='ObjectListPage').tbody.find_all("tr")
                regex = re.compile(r"noticeId=(\d+)")

                for tr in trs:
                    if (tr.text.find("标准化文档更新") > 0):
                        id = regex.findall(str(tr))[0]
                        ids.append(id)
                        print("bzh id:" + id)

                        last_update = tr.find_all("td")[1].get_text().strip()
                        date_format = time.strftime("%Y%m%d", time.strptime(last_update, "%Y-%m-%d %H:%M:%S"))
                        file_name = "标准化文档-" + date_format + ".rar"

                        crawlFile(id, file_name)

        except Exception as e:
            print("get ids fail：", e)

    return ids

# 下载
def crawlFile(id, file_name):
    down_url = noticeURL + id
    metaFile = "./bzh/" + file_name

    response = s.get(down_url, headers=headers, cookies=cookie)
    content = response.headers.get('Content-Disposition')
    filename = content[content.find('=') + 1:]
    filename = filename.encode('iso-8859-1').decode('GBK')

    print("remote:" + filename)

    try:
        f = open(metaFile, 'wb')
        f.write(response.content)
        f.close()

        print(file_name + " first download success")
        exit(0)
    except Exception as e:
        print(file_name + " download fail", e)

if __name__ == "__main__":
    yaml_path = os.path.join('../', 'config.yaml')
    with open(yaml_path, 'r') as f:
        config = yaml.load(f, Loader=yaml.FullLoader)

    host_url = config['host_url']
    noticeListUrl = host_url + config['noticeListUrl']
    noticeDetailUrl = host_url + config['noticeDetailUrl']
    noticeURL = host_url + config['noticeURL']

    bzh_host_usr = config['bzh_host_usr']
    bzh_host_pwd = config['bzh_host_pwd']
    table_meta_bg_date = config['table_meta_bg_date']

    # header头信息
    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3",
        "Referer": host_url + "login.jsp"
    }

    paramsNotice = {
        "queryStartTime": table_meta_bg_date
    }

    task_begin = datetime.now()
    print("Crawler begin time:" + str(task_begin))

    cookie = login()
    if cookie == "":
        print("cookie is null")
        exit(0)

    pageCount = get_pages()

    pageCount = 2
    if pageCount < 1:
        print("page < 1")
        exit(0)

    ids = get_ids(pageCount)

    task_end = datetime.now()
    print("Crawler end time:" + str(task_end))

执行结果分析

优化后的爬虫运行正常，之前的异常已被捕获，输出在error日志里。
更新过的代码在线上环境跑了4天，收集了4天的错误日志，想从时间点上观察，看能否继续优化。

源码如下

import os
import matplotlib.pyplot as plt

if __name__ == "__main__":
    print("analyze error of bzh crawler")

    error_con = {}
    error_html = {}
    for i in range(0, 24):
        key = "0" + str(i) if i < 10 else str(i)
        error_con[key] = 0
        error_html[key] = 0

    error_file = os.popen('ls ' + "./input").read().split()
    for i in range(0, len(error_file)):
        input = open('./input/' + error_file[i], 'r')

        for line in input:
            lines = line.split()
            error_msg = line[line.find("-", 50) + 2:]
            hour = lines[2][0:2]

            if error_msg.find("get html failed") > -1:
                error_con[hour] += 1
            elif error_msg.find("parse detail html failed") > -1:
                error_html[hour] += 1 / 2

    # 折线图
    plt.title("Plot of Error Hour Analyze(20190507-20190510)")
    plt.xlabel("Hour")
    plt.ylabel("Error Count")

    plt.plot(error_con.keys(), error_con.values(), color="r", linestyle="-", marker="^", linewidth=1, label="connect")
    plt.plot(error_html.keys(), error_html.values(), color="b", linestyle="-", marker="s", linewidth=1,
             label="parse html")
    plt.legend(loc='upper left', bbox_to_anchor=(0.55, 0.95))

    plt.show()

输出折线图

connect连接失败的次数明显多于parse解析失败，因为连接失败第一个页面就进不去了，也不存在后面的html解析
在连接正常的情况下，解析失败的次数占少数，4天的日志汇总，最多在1个小时里出现2次
2个折线图的走势基本一致，符合预期
折线图出现3个高峰，分别在凌晨4点，早上8点，晚上9点，推测远程服务器可能会定期重启，后期考虑是否加上爬虫时间过滤，晚上不执行来削峰
现在只有4天的日志，执行一段时间后收集长时间的日志，再观察是否和星期，天数，月份有关等

原文地址：https://www.cnblogs.com/wanli002/p/10850384.html

时间： 2024-10-12 04:32:44

python爬虫优化和错误日志分析的相关文章

安卓错误日志分析

安卓错误日志分析 1. java.lang.nullpointerexception 这个异常大家肯定都经常遇到,异常的解释是"程序遇上了空指针",简单地说就是调用了未经初始化的对象或者是不存在的对象,这个错误经常出现在创建图片,调用数组这些操作中,比如图片未经初始化,或者图片创建时的路径错误等等.对数组操作中出现空指针,很多情况下是一些刚开始学习编程的朋友常犯的错误,即把数组的初始化和数组元素的初始化混淆起来了.数组的初始化是对数组分配需要的空间,而初始化后的数组,其中的元素并没有实

小蚂蚁学习mysql性能优化（3）--SQL以及索引优化--慢查日志分析工具和explain说明

昨天在测试操作数据库的时候碰到两个问题忘了记录下来,今天补充上去,接上篇 1. 安装测试数据库sakila时报错.Mysql server has gone away的问题.解决方法: 查看 show global variables like 'max_allowed_packet'; 一般来说会显示 max_allowed_packet 1048576 修改为 set global max_allowed_packet = 1024*1024*16;

python接口测试之401错误的分析和解决（十六）

作者无涯在接口的测试中,经常会遇到客户端向服务端发送一个请求,服务端返回401的错误,那么今天本文章就来说明在接口测试中如何分析以及解决该问题. 我们知道在HTTP返回的状态码中,401错误表示的是被请求的页面需要用户名和密码.401的错误详细的可以描述为:客户端发送请求抖到服务端, 页面需要验证服务端会返回401的错误,见如下的错误信息: 401 UNAUTHORIZED Headers Content-Type: application/jsonWWW-Authenticate: Bas

MySQL优化之慢日志分析（Anemometer+Pt-query-digest）

介绍使用pt-query-digest搜集慢查询日志.将数据存储在两张表中:global_query_review 和 global_query_review_history.然后使用anemometer将pt-query-digest 搜集的数据以web形式展现出来,方便查询分析. 1.准备条件:a.LNMP平台b.MySQL开启慢查询 slow_query_log=on #开启数据库的慢日志 long_query_time=0.1 #时间超过0.1s的SQL记录日志中

QQ空间Python爬虫（1）---网站分析

闲来无事准备写一个爬虫来爬取自己QQ空间的所有说说和图片-.- 首先准备工作,进入手机版QQ空间,分析页面: 我们发现,手机版空间翻页模式是采用瀑布流翻页(查看更多),而非传统翻页模式,所以我们需要来分析一下点击"查看更多"时发送的请求: 可以发现,上面红框中的xhr就是点击"查看更多"时发送的请求,我们再进一步分析: 如图,红框中的request url和request headers是我们需要的信息,首先我们在代码中加入请求头headers: 1 headers

Hbase错误日志分析

2017-03-13 18:23:11,852 INFO [namenode1:60000.activeMasterManager-SendThread(app2:2181)] zookeeper.ClientCnxn: Opening socket connection to server app2/10.226.21.35:2181. Will not attempt to authenticate using SASL (unknown error)2017-03-13 18:23:11

python爬虫-纠正MD5错误认知

m = md5("12345678".encode()) print(m.hexdigest()) # 25d55ad283aa400af464c76d713c07ad m = md5("1234".encode()) print(m.hexdigest()) # 81dc9bdb52d04dc20036dbd8313ed055 m.update("5678".encode()) print(m.hexdigest()) # 25d55ad283

PYTHON上海分享活动小记---SQUID日志分析项目开发

上周末有幸跑到上海为小伙伴们分享了<SQUID日志分析项目>,主要是带大家用PYTHON迅速实现一个SQUID日志分析平台的前后端开发,一天的课程太紧张,导致有些细节不能完全实现,但整体思路啥的基本都OK啦,可惜的是由于电脑没配置好,导致没法录像....,要不然就可以放到网上与大家一起分享了,现在只能上几张图了... 最后感谢波波同学,无偿负责组织策划了这次分享活动,感谢柏林,提供场地支持. 感谢大家花周末时间参加这个活动,希望此次分享对各位有所帮助.. PYTHON上海分享活动小记---S

输出错误日志到文件

#!/usr/bin/python # -*- coding: utf-8 -*- #将错误日志输出到文件 import time log_path ='/root/log.log' fo = open(log_path,'a') a=0 try: result=10/a print result except StandardError, e: print 'serror:', e fo.write(time.strftime('%Y-%m-%d %H:%M:%S',time.localtim