response = urllib2.urlopen(‘http://www.baidu.com‘)
user_agent = ‘Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)‘
request = urllib2.Request(url, headers={ ‘User-Agent‘: user_agent }) response = urllib2.urlopen(request)
Requset类有5个参数:url,data,headers,origin_req_host,unverifiable 。
- url不必说了就是我们要请求的url地址
- data是我们要向服务器提交的额外的数据,如果没有数据可以为None,请求如果是由数据的话那就是POST请求,这些数据需要以标准的格式编码然后传送给request对象。
- headers请求头,是一个字典类型的。它是告诉服务器请求的一些信息,例如像请求的浏览器信息,操作系统信息,cookie,返回信息格式,缓存,是否支持压缩等等,像一些反爬虫的网站会监测请求的类型,我们需要伪装成浏览器而不是直接发起请求,例如上面代码里的User-Agent
- origin_req_host是RFC2965定义的源交互的request-host。默认的取值是cookielib.request_host(self)。这是由用户发起的原始请求的主机名或IP地址。例如,如果请求的是一个HTML文档中的图像,这应该是包含该图像的页面请求的request-host。
- unverifiable代表请求是否是无法验证的,它也是由RFC2965定义的。默认值为false。一个无法验证的请求是,其用户的URL没有足够的权限来被接受。
try: response = urllib2.urlopen(‘http://www.baidu.com‘) except urllib2.HTTPError as e: print e.code print e.reason except urllib2.URLError as e: print e.reason else: response.read()
# Table mapping response codes to messages; entries have the # form {code: (shortmessage, longmessage)}. responses = { 100: (‘Continue‘, ‘Request received, please continue‘), 101: (‘Switching Protocols‘, ‘Switching to new protocol; obey Upgrade header‘), 200: (‘OK‘, ‘Request fulfilled, document follows‘), 201: (‘Created‘, ‘Document created, URL follows‘), 202: (‘Accepted‘, ‘Request accepted, processing continues off-line‘), 203: (‘Non-Authoritative Information‘, ‘Request fulfilled from cache‘), 204: (‘No Content‘, ‘Request fulfilled, nothing follows‘), 205: (‘Reset Content‘, ‘Clear input form for further input.‘), 206: (‘Partial Content‘, ‘Partial content follows.‘), 300: (‘Multiple Choices‘, ‘Object has several resources -- see URI list‘), 301: (‘Moved Permanently‘, ‘Object moved permanently -- see URI list‘), 302: (‘Found‘, ‘Object moved temporarily -- see URI list‘), 303: (‘See Other‘, ‘Object moved -- see Method and URL list‘), 304: (‘Not Modified‘, ‘Document has not changed since given time‘), 305: (‘Use Proxy‘, ‘You must use proxy specified in Location to access this ‘ ‘resource.‘), 307: (‘Temporary Redirect‘, ‘Object moved temporarily -- see URI list‘), 400: (‘Bad Request‘, ‘Bad request syntax or unsupported method‘), 401: (‘Unauthorized‘, ‘No permission -- see authorization schemes‘), 402: (‘Payment Required‘, ‘No payment -- see charging schemes‘), 403: (‘Forbidden‘, ‘Request forbidden -- authorization will not help‘), 404: (‘Not Found‘, ‘Nothing matches the given URI‘), 405: (‘Method Not Allowed‘, ‘Specified method is invalid for this server.‘), 406: (‘Not Acceptable‘, ‘URI not available in preferred format.‘), 407: (‘Proxy Authentication Required‘, ‘You must authenticate with ‘ ‘this proxy before proceeding.‘), 408: (‘Request Timeout‘, ‘Request timed out; try again later.‘), 409: (‘Conflict‘, ‘Request conflict.‘), 410: (‘Gone‘, ‘URI no longer exists and has been permanently removed.‘), 411: (‘Length Required‘, ‘Client must specify Content-Length.‘), 412: (‘Precondition Failed‘, ‘Precondition in headers is false.‘), 413: (‘Request Entity Too Large‘, ‘Entity is too large.‘), 414: (‘Request-URI Too Long‘, ‘URI is too long.‘), 415: (‘Unsupported Media Type‘, ‘Entity body in unsupported format.‘), 416: (‘Requested Range Not Satisfiable‘, ‘Cannot satisfy request range.‘), 417: (‘Expectation Failed‘, ‘Expect condition could not be satisfied.‘), 500: (‘Internal Server Error‘, ‘Server got itself in trouble‘), 501: (‘Not Implemented‘, ‘Server does not support this operation‘), 502: (‘Bad Gateway‘, ‘Invalid responses from another server/proxy.‘), 503: (‘Service Unavailable‘, ‘The server cannot process the request due to a high load‘), 504: (‘Gateway Timeout‘, ‘The gateway server did not receive a timely response‘), 505: (‘HTTP Version Not Supported‘, ‘Cannot fulfill request.‘), }
requests使用的是urllib3,继承了urllib2的所有特性,Requests支持HTTP连接保持和连接池,支持使用cookie保持会话,支持文件上传,支持自 动确定响应内容的编码,支持国际化的 URL 和 POST 数据自动编码。
response = requests.get(‘http://www.baidu.com‘) print response.text
response = requests.post(‘http://api.baidu.com‘, data={ ‘data‘: ‘value‘ })
response = requests.get(‘http://www.baidu.com‘, headers={ ‘User-Agent‘: user_agent })
r.status_code #响应状态码
r.raw #返回原始响应体,也就是 urllib 的 response 对象,使用 r.raw.read() 读取
r.content #字节方式的响应体,会自动为你解码 gzip 和 deflate 压缩
r.text #字符串方式的响应体,会自动根据响应头部的字符编码进行解码
r.headers #以字典对象存储服务器响应头,但是这个字典比较特殊,字典键不区分大小写,若键不存在则返回None
r.json() #Requests中内置的JSON解码器
r.raise_for_status() #失败请求(非200响应)抛出异常