urllib2的简单介绍
参考网址:http://www.voidspace.org.uk/python/articles/urllib2.shtml
Fetching URLs
The simplest way to use urllib2 is as follows :
1、
import urllib2
response = urllib2.urlopen(‘http://python.org/‘)
html = response.read()
2、
import urllib2
req = urllib2.Request(‘http://www.voidspace.org.uk‘)
response = urllib2.urlopen(req)
the_page = response.read()
3、
req = urllib2.Request(‘ftp://example.com/‘)
4、
import urllib
import urllib2
url = ‘http://www.someserver.com/cgi-bin/register.cgi‘
values = {‘name‘ : ‘Michael Foord‘,
‘location‘ : ‘Northampton‘,
‘language‘ : ‘Python‘ }
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
5、
Data can also be passed in an HTTP GET request by encoding it in the URL itself.
>>> import urllib2
>>> import urllib
>>> data = {}
>>> data[‘name‘] = ‘Somebody Here‘
>>> data[‘location‘] = ‘Northampton‘
>>> data[‘language‘] = ‘Python‘
>>> url_values = urllib.urlencode(data)
>>> print url_values
name=Somebody+Here&language=Python&location=Northampton
>>> url = ‘http://www.example.com/example.cgi‘
>>> full_url = url + ‘?‘ + url_values
>>> data = urllib2.urlopen(full_url)
6、
import urllib
import urllib2
url = ‘http://www.someserver.com/cgi-bin/register.cgi‘
user_agent = ‘Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)‘
values = {‘name‘ : ‘Michael Foord‘,
‘location‘ : ‘Northampton‘,
‘language‘ : ‘Python‘ }
headers = { ‘User-Agent‘ : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()
7、Handling Exceptions
1)URLError
>>> req = urllib2.Request(‘http://www.jianshu.com/p/5c7a1af4aa531‘)
>>> try: urllib2.urlopen(req)
>>> except URLError, e:
>>> print e.reason
>>> print e,e.code #分别表示凡返回错误类型,错误代码和类型,错误代码
2)HTTPError is the subclass of URLError raised in the specific case of HTTP URLs.
HTTPError
Every HTTP response from the server contains a numeric "status code". Sometimes the status code indicates that the server is unable to fulfil the request
3)Error Codes
100: (‘Continue‘, ‘Request received, please continue‘),
101: (‘Switching Protocols‘,
‘Switching to new protocol; obey Upgrade header‘),
200: (‘OK‘, ‘Request fulfilled, document follows‘),
201: (‘Created‘, ‘Document created, URL follows‘),
202: (‘Accepted‘,
‘Request accepted, processing continues off-line‘),
203: (‘Non-Authoritative Information‘, ‘Request fulfilled from cache‘),
204: (‘No Content‘, ‘Request fulfilled, nothing follows‘),
205: (‘Reset Content‘, ‘Clear input form for further input.‘),
206: (‘Partial Content‘, ‘Partial content follows.‘),
300: (‘Multiple Choices‘,
‘Object has several resources -- see URI list‘),
301: (‘Moved Permanently‘, ‘Object moved permanently -- see URI list‘),
302: (‘Found‘, ‘Object moved temporarily -- see URI list‘),
303: (‘See Other‘, ‘Object moved -- see Method and URL list‘),
304: (‘Not Modified‘,
‘Document has not changed since given time‘),
305: (‘Use Proxy‘,
‘You must use proxy specified in Location to access this ‘
‘resource.‘),
307: (‘Temporary Redirect‘,
‘Object moved temporarily -- see URI list‘),
400: (‘Bad Request‘,
‘Bad request syntax or unsupported method‘),
401: (‘Unauthorized‘,
‘No permission -- see authorization schemes‘),
402: (‘Payment Required‘,
‘No payment -- see charging schemes‘),
403: (‘Forbidden‘,
‘Request forbidden -- authorization will not help‘),
404: (‘Not Found‘, ‘Nothing matches the given URI‘),
405: (‘Method Not Allowed‘,
‘Specified method is invalid for this server.‘),
406: (‘Not Acceptable‘, ‘URI not available in preferred format.‘),
407: (‘Proxy Authentication Required‘, ‘You must authenticate with ‘
‘this proxy before proceeding.‘),
408: (‘Request Timeout‘, ‘Request timed out; try again later.‘),
409: (‘Conflict‘, ‘Request conflict.‘),
410: (‘Gone‘,
‘URI no longer exists and has been permanently removed.‘),
411: (‘Length Required‘, ‘Client must specify Content-Length.‘),
412: (‘Precondition Failed‘, ‘Precondition in headers is false.‘),
413: (‘Request Entity Too Large‘, ‘Entity is too large.‘),
414: (‘Request-URI Too Long‘, ‘URI is too long.‘),
415: (‘Unsupported Media Type‘, ‘Entity body in unsupported format.‘),
416: (‘Requested Range Not Satisfiable‘,
‘Cannot satisfy request range.‘),
417: (‘Expectation Failed‘,
‘Expect condition could not be satisfied.‘),
500: (‘Internal Server Error‘, ‘Server got itself in trouble‘),
501: (‘Not Implemented‘,
‘Server does not support this operation‘),
502: (‘Bad Gateway‘, ‘Invalid responses from another server/proxy.‘),
503: (‘Service Unavailable‘,
‘The server cannot process the request due to a high load‘),
504: (‘Gateway Timeout‘,
‘The gateway server did not receive a timely response‘),
505: (‘HTTP Version Not Supported‘, ‘Cannot fulfill request.‘),
4)
例子1:
from urllib2 import Request, urlopen, URLError, HTTPError
req = Request(someurl)
try:
response = urlopen(req)
except HTTPError, e:
print ‘The server couldn\‘t fulfill the request.‘
print ‘Error code: ‘, e.code
except URLError, e:
print ‘We failed to reach a server.‘
print ‘Reason: ‘, e.reason
else:
注意:HTTPError 是URLError的子类,要写在前面
例子2:
from urllib2 import Request, urlopen, URLError
req = Request(someurl)
try:
response = urlopen(req)
except URLError, e:
if hasattr(e, ‘reason‘):
print ‘We failed to reach a server.‘
print ‘Reason: ‘, e.reason
elif hasattr(e, ‘code‘):
print ‘The server couldn\‘t fulfill the request.‘
print ‘Error code: ‘, e.code
else:
# everything is fine
例子3:
from urllib2 import Request, urlopen
req = Request(someurl)
try:
response = urlopen(req)
except IOError, e:
if hasattr(e, ‘reason‘):
print ‘We failed to reach a server.‘
print ‘Reason: ‘, e.reason
elif hasattr(e, ‘code‘):
print ‘The server couldn\‘t fulfill the request.‘
print ‘Error code: ‘, e.code
else:
# everything is fine
注意:URLError是IOError的子类,极少数情况下可能会报socket.error
8、info and geturl
geturl:
这将返回所获取页面的真实URL。这很有用,因为urlopen(或者使用的打开器对象)可能已经遵循了重定向。所获取的页面的URL可能与所请求的URL不相同
info:
这会返回一个类似于字典的对象,描述所获取的页面,特别是服务器发送的头部。它现在是一个httplib。HTTPMessage实例.
例子:
from urllib2 import Request,urlopen,URLError,HTTPError
url = ‘https://passport.baidu.com/center?_t=1510744860‘
req = Request(url)
response = urlopen(req)
print response.info()
print response.geturl()
9、Openers and Handlers
Openers:
当你获取一个URL你使用一个opener(一个urllib2.OpenerDirector的实例)。正常情况下,我们使用默认opener:通过urlopen。但你能够创建个性的openers。可以用build_opener来创建opener对象。一般可用于需要处理cookie或者不想进行redirection的应用场景(You will want to create openers if you want to fetch URLs with specific handlers installed, for example to get an opener that handles cookies, or to get an opener that does not handle redirections.)
以下是用代理ip模拟登录时(需要处理cookie)使用handler和opener的具体流程。
1 self.proxy = urllib2.ProxyHandler({‘http‘: self.proxy_url})
2 self.cookie = cookielib.LWPCookieJar()
3 self.cookie_handler = urllib2.HTTPCookieProcessor(self.cookie)
4 self.opener = urllib2.build_opener(self.cookie_handler, self.proxy, urllib2.HTTPHandler)
Handles:
Openers使用处理器handlers,所有的“繁重”工作由handlers处理。每个handlers知道如何通过特定协议打开URLs,或者如何处理URL打开时的各个方面。例如HTTP重定向或者HTTP cookies。
更多关于Openers和Handlers的信息。http://www.voidspace.org.uk/python/articles/urllib2.shtml#openers-and-handlers
10、Proxies
proxy代理ip创建opener
Note:Currently urllib2 does not support fetching of https locations through a proxy. This can be a problem.
(http://www.voidspace.org.uk/python/articles/urllib2.shtml#proxies)
例子:
1 import urllib2
2 proxy——handler = urllib2.ProxyHandler({‘http‘: ‘54.186.78.110:3128‘})#注意要确保该代理ip可用
3 opener = urllib2.build_opener(proxy_handler)
4 request = urllib2.Request(url, post_data, login_headers)#该例中还需要提交post_data和header信息
5 response = opener.open(request)
6 print response.read().encode(‘utf-8‘)
11、Sockets and Layers
例子:
import socket
import urllib2
# timeout in seconds
timeout = 10
socket.setdefaulttimeout(timeout)
# this call to urllib2.urlopen now uses the default timeout
# we have set in the socket module
req = urllib2.Request(‘http://www.voidspace.org.uk‘)
response = urllib2.urlopen(req)
12、Cookie
urllib2 对 Cookie 的处理也是自动的。如果需要得到某个 Cookie 项的值,可以这么做:
例子:
import urllib2
import cookielib
cookie = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
response = opener.open(‘http://www.baidu.com‘)
for item in cookie:
print ‘Name = ‘+item.name
print ‘Value = ‘+item.value
运行之后就会输出访问百度的Cookie值:
Name = BAIDUID
Value = C664216C4F7BD6B98DB0B300292E0A23:FG=1
Name = BIDUPSID
Value = C664216C4F7BD6B98DB0B300292E0A23
Name = H_PS_PSSID
Value = 1464_21099_17001_24879_22159
Name = PSTM
Value = 1510747061
Name = BDSVRTM
Value = 0
Name = BD_HOME
Value = 0
13、对付"反盗链"
某些站点有所谓的反盗链设置,其实说穿了很简单,
就是检查你发送请求的header里面,referer站点是不是他自己,
所以我们只需要像把headers的referer改成该网站即可,以baidu为例:
#...
headers = {
‘Referer‘:‘http://www.baidu.com/‘
}
#...
headers是一个dict数据结构,你可以放入任何想要的header,来做一些伪装。
例如,有些网站喜欢读取header中的X-Forwarded-For来看看人家的真实IP,可以直接把X-Forwarde-For改了。