图片使用js onload事件加载
<p><img src="//img.jandan.net/img/blank.gif" /><span class="img-hash">Ly93eDEuc2luYWltZy5jbi9tdzYwMC8wMDd1ejNLN2x5MWZ6NmVub3ExdHhqMzB1MDB1MGFkMC5qcGc=</span></p>
找到soureces 文件中对应的js 方法jandan_load_img
通过debugger js 将Ly93eDEuc2luYWltZy5jbi9tdzYwMC8wMDd1ejNLN2x5MWZ6NmVub3ExdHhqMzB1MDB1MGFkMC5qcGc= 传入函数jdugRtgCtw78dflFjGXBvN6TBHAoKvZ7xu base64_decode得到img路经
再通过正则表达式将img路径中的(/W+)替换为large
爬取代码如下:
import base64 import re import requests from concurrent.futures import ThreadPoolExecutor from random import choice from lxml import etree from user_agent_list import USER_AGENTS headers = {‘user-agent‘: choice(USER_AGENTS)} def fetch_url(url): ‘‘‘ :param url: 路径 :return: html ‘‘‘ try: r = requests.get(url, headers=headers) r.raise_for_status() r.encoding = r.apparent_encoding if r.status_code in [200, 201]: return r.text except Exception as e: print(e) def downloadone(url): html = fetch_url(url) data = etree.HTML(html) img_hash_list = data.xpath(‘//*[@class="img-hash"]/text()‘) for img_hash in img_hash_list: img_path = ‘http:‘ + bytes.decode(base64.b64decode(img_hash)) img_path = re.sub(r‘mw\d+‘, ‘large‘, img_path) img_name = img_path.rsplit(‘/‘, 1)[1] with open(‘jiandan/‘+img_name, ‘wb‘) as f: r = requests.get(img_path) f.write(r.content) def main(): url_list = [] for _ in range(1, 44): url = ‘http://jandan.net/ooxx/page-{}‘.format(_) url_list.append(url) with ThreadPoolExecutor(4) as executor: executor.map(downloadone, url_list) if __name__ == ‘__main__‘: main()
原文地址:https://www.cnblogs.com/frank-shen/p/10269363.html
时间: 2024-10-19 13:55:33