做的第一个爬虫就遇上了硬茬儿,可能是我http头没改好还是我点击次数过高导致无法循环爬取煎蛋网的妹子图。。。。
不过也算是邪恶了一把。。。技术本无罪~~~
爬了几页的照片下来还是很赏心悦目的~
import urllib.request import re import time import requests k=1 def read_url(url,k): user_agent = ‘Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36‘ headers = {‘User-Agent‘: user_agent, ‘Referer‘: ‘http://jandan.net/ooxx/page-2406‘ # ‘Accept‘: ‘image / webp, image / *, * / *;q = 0.8‘, # ‘Referer‘: ‘http: // cdn.jandan.net / wp - content / themes / egg / style.css?v = 20170319‘, # ‘Accept - Encoding‘: ‘gzip, deflate, sdch‘, # ‘Accept - Language‘: ‘zh - CN, zh; q = 0.8‘, # ‘Connection‘:‘close‘, # ‘host‘:‘cdn.jandan.net‘, # ‘Accept - Encoding‘: ‘dentity‘ } #cok = {"Cookie":"_ga=GA1.2.1842145399.1491574879; Hm_lvt_fd93b7fb546adcfbcf80c4fc2b54da2c=1491574879; Hm_lpvt_fd93b7fb546adcfbcf80c4fc2b54da2c=1491575669"} r=urllib.request.Request(url,headers=headers) req = urllib.request.urlopen(r) # print(req.read().decode(‘utf-8‘)) # r= requests.get(url,headers=headers,cookies=cok) image_d(req.read().decode(‘utf-8‘),k) def image_d(data,k): print(‘正在爬取第%d页图片‘ %k) # datalist = [] dirct = ‘C:\\Users\eexf\Desktop\jiandan‘ pattern = re.compile(‘<img src="(.*?)" /></p>‘) res = re.findall(pattern,data) for i in res: j = ‘http:‘+i data1 = urllib.request.urlopen(j).read() k = re.split(‘/‘, j)[-1] # print(i) path = dirct + ‘/‘ + k f = open(path, ‘wb‘) f.write(data1) f.close() print(‘爬取完成‘) # respon = urllib.request.Request(i) # data1= urllib.request.urlopen(respon).read().decode(‘utf-8‘) # # datalist.append(data1) # print(datalist) if __name__==‘__main__‘: url = ‘http://jandan.net/ooxx/page-2406‘#+str(i) read_url(url,k) k+=1 # time.sleep(3)
基本的结构框架也就是:请求网页源代码-->通过正则表达式匹配相应的图片地址返回一个列表-->将列表中所有地址中的内容全部写入一个文件夹。。
代码很乱,第一个爬虫权当留个纪念~~~
附上福利:
时间: 2024-10-14 07:14:44