爬虫模拟浏览器实例:
方法一、使用build_opener()修改报头
#!/usr/bin/env python # -*-coding:utf-8-*- # __author__="life" import urllib.request url = ‘http://blog.csdn.net/justloveyou_/article/details/69611328‘ # file = urllib.request.urlopen(url) headers = ("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) " "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36") opener = urllib.request.build_opener() opener.addheaders = [headers] data = opener.open(url).read() with open(‘3.html‘,‘wb‘) as fhandle: fhandle.write(data)
方法2、使用add_header()添加报头
import urllib.request url = ‘http://blog.csdn.net/justloveyou_/article/details/69611328‘ req = urllib.request.Request(url) req.add_header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) " "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36") data = urllib.request.urlopen(req).read()
超时设置:
将timeout值设置为1
#! /usr/bin/env python3 # -*- coding: utf-8 -*- # __author__ = "life" # Email: [email protected] # Date: 17-4-8 import urllib.request for i in range(1,100): try: file = urllib.request.urlopen("http://yum.iqianyue.com",timeout = 1) data = file.read() print(len(data)) except Exception as e: print(‘出现异常--->>‘ + str(e))
HTTP协议请求
GET请求:get请求会通过url网址传递信息
POST请求:可以向服务器存储提交数据,是一种比较主流也比较安全的数据传递方式,登录过程经常使用
PUT请求:请求服务器存储一个资源,通常要指定存储的位置
DELETE请求:请求服务器删除一个资源
HEAD请求:请求获取对应的HTTP报头信息
OPTIONS请求:可以获得当前URL所支持的请求类型
TRACE请求主要于测试和诊断
1、GET请求实例分析:
百度搜索是用GET请求,https://www.baidu.com/s?wd=hello的结构,根据此原理,构造GET请求,用爬虫实现在百度上自动查询某个关键词
用百度上查询关键词为hello的结果,代码实例如下:
#! /usr/bin/env python3 # -*- coding: utf-8 -*- # __author__ = "life" # Email: [email protected] # Date: 17-4-8 import urllib.request keyword = ‘hello‘ url = ‘http://www.baidu.com/s?wd=‘ + keyword req = urllib.request.Request(url) data = urllib.request.urlopen(req).read() with open(‘r.html‘,‘wb‘) as fhandle: fhandle.write(data)上面的代码还有些问题,如果搜索关键字为中文就会出错,优化一下:
import urllib.requestkeyword = ‘刘东‘key_encode = urllib.request.quote(keyword) # 编码中文url = ‘http://www.baidu.com/s?wd=‘url_all = url + key_encodereq = urllib.request.Request(url_all)data = urllib.request.urlopen(req).read()with open(‘r.html‘,‘wb‘) as fhandle: fhandle.write(data)
2、POST请求实例分析:
测试网址为:http://www.iqianyue.com/mypost
import urllib.request import urllib.parse url = "http://www.iqianyue.com/mypost/" postdata = urllib.parse.urlencode({ "name":"[email protected]", "pass":"aA123456" }).encode(‘utf-8‘) # 将数据使用urlencode编码处理后,使用encode()设置为utf-8编码 req = urllib.request.Request(url,postdata) req.add_header(‘User-Agent‘,"Mozilla/5.0 (Windows NT 10.0; WOW64) " "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36") data = urllib.request.urlopen(req).read() with open(‘6.html‘,‘wb‘) as fhandle: fhandle.write(data)
3、代理服务器设置
代理服务器地址查找:http://yum.iqianyue.com/proxy
通过代理服务器爬取网站实例:
def use_proxy(proxy_addr,url): import urllib.request proxy = urllib.request.ProxyHandler({‘http‘:proxy_addr}) opener = urllib.request.build_opener(proxy,urllib.request.HTTPHandler) urllib.request.install_opener(opener) data = urllib.request.urlopen(url).read().decode(‘utf-8‘) return data proxy_addr = "110.73.1.58:8123" data = use_proxy(proxy_addr,‘http://www.baidu.com‘) print(len(data))
时间: 2024-11-05 12:11:26