爬取b站博人传
每页短评20个,页数超过1000页,
代码如下
import requests import json import csv def main(start_url): headers = {‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36‘,} res = requests.get(url=start_url,headers=headers).content.decode() data = json.loads(res) try: data = data[‘result‘][‘list‘] except: print(‘-----------‘) cursor = re.findall(‘"cursor":"(\d+)",‘,res) for i in data: mid = i[‘author‘][‘mid‘] uname = i[‘author‘][‘uname‘] content = i[‘content‘] content= content.strip() try: last_index_show = i[‘user_season‘][‘last_index_show‘] except: last_index_show = None print(mid,uname,content,last_index_show) print(‘------------------------‘) with open(‘borenzhuan_duanping.csv‘, ‘a‘, newline=‘‘,encoding=‘utf-8‘)as f: writer = csv.writer(f) writer.writerow([mid,uname,content,last_index_show]) if cursor: next_url = ‘https://bangumi.bilibili.com/review/web_api/short/list?media_id={}&folded=0&page_size=20&sort=0&sort=0&cursor=‘.format(id) + cursor[0] main(next_url) else: print(‘抓取完成‘) if __name__ == ‘__main__‘: zhuye_url = ‘https://www.bilibili.com/bangumi/media/md5978/‘ id = re.findall(‘md(\d+)‘, zhuye_url)[0] start_url = ‘https://bangumi.bilibili.com/review/web_api/short/list?media_id={}&folded=0&page_size=20&sort=0&cursor=‘.format(id) main(start_url)
在爬取过程中发现,每当递归到999会发生异常
RecursionError: maximum recursion depth exceeded in comparison
这个函数在递归自身是发生的异常
只需要在程序开头添加
import sys sys.setrecursionlimit(100000)
防止内存爆炸
原文地址:https://www.cnblogs.com/zengxm/p/10972537.html
时间: 2024-10-22 07:19:18