python urllib2库的简单总结

urllib2的简单介绍
参考网址:http://www.voidspace.org.uk/python/articles/urllib2.shtml

Fetching URLs
The simplest way to use urllib2 is as follows :
1、
import urllib2
response = urllib2.urlopen(‘http://python.org/‘)
html = response.read()

2、
import urllib2
req = urllib2.Request(‘http://www.voidspace.org.uk‘)
response = urllib2.urlopen(req)
the_page = response.read()
3、
req = urllib2.Request(‘ftp://example.com/‘)
4、
import urllib
import urllib2

url = ‘http://www.someserver.com/cgi-bin/register.cgi‘
values = {‘name‘ : ‘Michael Foord‘,
‘location‘ : ‘Northampton‘,
‘language‘ : ‘Python‘ }

data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()
5、
Data can also be passed in an HTTP GET request by encoding it in the URL itself.

>>> import urllib2
>>> import urllib
>>> data = {}
>>> data[‘name‘] = ‘Somebody Here‘
>>> data[‘location‘] = ‘Northampton‘
>>> data[‘language‘] = ‘Python‘
>>> url_values = urllib.urlencode(data)
>>> print url_values
name=Somebody+Here&language=Python&location=Northampton
>>> url = ‘http://www.example.com/example.cgi‘
>>> full_url = url + ‘?‘ + url_values
>>> data = urllib2.urlopen(full_url)
6、
import urllib
import urllib2

url = ‘http://www.someserver.com/cgi-bin/register.cgi‘
user_agent = ‘Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)‘
values = {‘name‘ : ‘Michael Foord‘,
‘location‘ : ‘Northampton‘,
‘language‘ : ‘Python‘ }
headers = { ‘User-Agent‘ : user_agent }

data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()

7、Handling Exceptions
1)URLError
>>> req = urllib2.Request(‘http://www.jianshu.com/p/5c7a1af4aa531‘)
>>> try: urllib2.urlopen(req)
>>> except URLError, e:
>>> print e.reason
>>> print e,e.code #分别表示凡返回错误类型,错误代码和类型,错误代码

2)HTTPError is the subclass of URLError raised in the specific case of HTTP URLs.
HTTPError
Every HTTP response from the server contains a numeric "status code". Sometimes the status code indicates that the server is unable to fulfil the request
3)Error Codes
100: (‘Continue‘, ‘Request received, please continue‘),
101: (‘Switching Protocols‘,
‘Switching to new protocol; obey Upgrade header‘),

200: (‘OK‘, ‘Request fulfilled, document follows‘),
201: (‘Created‘, ‘Document created, URL follows‘),
202: (‘Accepted‘,
‘Request accepted, processing continues off-line‘),
203: (‘Non-Authoritative Information‘, ‘Request fulfilled from cache‘),
204: (‘No Content‘, ‘Request fulfilled, nothing follows‘),
205: (‘Reset Content‘, ‘Clear input form for further input.‘),
206: (‘Partial Content‘, ‘Partial content follows.‘),

300: (‘Multiple Choices‘,
‘Object has several resources -- see URI list‘),
301: (‘Moved Permanently‘, ‘Object moved permanently -- see URI list‘),
302: (‘Found‘, ‘Object moved temporarily -- see URI list‘),
303: (‘See Other‘, ‘Object moved -- see Method and URL list‘),
304: (‘Not Modified‘,
‘Document has not changed since given time‘),
305: (‘Use Proxy‘,
‘You must use proxy specified in Location to access this ‘
‘resource.‘),
307: (‘Temporary Redirect‘,
‘Object moved temporarily -- see URI list‘),

400: (‘Bad Request‘,
‘Bad request syntax or unsupported method‘),
401: (‘Unauthorized‘,
‘No permission -- see authorization schemes‘),
402: (‘Payment Required‘,
‘No payment -- see charging schemes‘),
403: (‘Forbidden‘,
‘Request forbidden -- authorization will not help‘),
404: (‘Not Found‘, ‘Nothing matches the given URI‘),
405: (‘Method Not Allowed‘,
‘Specified method is invalid for this server.‘),
406: (‘Not Acceptable‘, ‘URI not available in preferred format.‘),
407: (‘Proxy Authentication Required‘, ‘You must authenticate with ‘
‘this proxy before proceeding.‘),
408: (‘Request Timeout‘, ‘Request timed out; try again later.‘),
409: (‘Conflict‘, ‘Request conflict.‘),
410: (‘Gone‘,
‘URI no longer exists and has been permanently removed.‘),
411: (‘Length Required‘, ‘Client must specify Content-Length.‘),
412: (‘Precondition Failed‘, ‘Precondition in headers is false.‘),
413: (‘Request Entity Too Large‘, ‘Entity is too large.‘),
414: (‘Request-URI Too Long‘, ‘URI is too long.‘),
415: (‘Unsupported Media Type‘, ‘Entity body in unsupported format.‘),
416: (‘Requested Range Not Satisfiable‘,
‘Cannot satisfy request range.‘),
417: (‘Expectation Failed‘,
‘Expect condition could not be satisfied.‘),

500: (‘Internal Server Error‘, ‘Server got itself in trouble‘),
501: (‘Not Implemented‘,
‘Server does not support this operation‘),
502: (‘Bad Gateway‘, ‘Invalid responses from another server/proxy.‘),
503: (‘Service Unavailable‘,
‘The server cannot process the request due to a high load‘),
504: (‘Gateway Timeout‘,
‘The gateway server did not receive a timely response‘),
505: (‘HTTP Version Not Supported‘, ‘Cannot fulfill request.‘),
4)
例子1:
from urllib2 import Request, urlopen, URLError, HTTPError
req = Request(someurl)
try:
response = urlopen(req)
except HTTPError, e:
print ‘The server couldn\‘t fulfill the request.‘
print ‘Error code: ‘, e.code
except URLError, e:
print ‘We failed to reach a server.‘
print ‘Reason: ‘, e.reason
else:

注意:HTTPError 是URLError的子类,要写在前面
例子2:
from urllib2 import Request, urlopen, URLError
req = Request(someurl)
try:
response = urlopen(req)
except URLError, e:
if hasattr(e, ‘reason‘):
print ‘We failed to reach a server.‘
print ‘Reason: ‘, e.reason
elif hasattr(e, ‘code‘):
print ‘The server couldn\‘t fulfill the request.‘
print ‘Error code: ‘, e.code
else:
# everything is fine

例子3:
from urllib2 import Request, urlopen
req = Request(someurl)
try:
response = urlopen(req)
except IOError, e:
if hasattr(e, ‘reason‘):
print ‘We failed to reach a server.‘
print ‘Reason: ‘, e.reason
elif hasattr(e, ‘code‘):
print ‘The server couldn\‘t fulfill the request.‘
print ‘Error code: ‘, e.code
else:
# everything is fine
注意:URLError是IOError的子类,极少数情况下可能会报socket.error

8、info and geturl
geturl:
这将返回所获取页面的真实URL。这很有用,因为urlopen(或者使用的打开器对象)可能已经遵循了重定向。所获取的页面的URL可能与所请求的URL不相同
info:
这会返回一个类似于字典的对象,描述所获取的页面,特别是服务器发送的头部。它现在是一个httplib。HTTPMessage实例.
例子:
from urllib2 import Request,urlopen,URLError,HTTPError
url = ‘https://passport.baidu.com/center?_t=1510744860‘
req = Request(url)
response = urlopen(req)
print response.info()
print response.geturl()

9、Openers and Handlers
Openers:

  当你获取一个URL你使用一个opener(一个urllib2.OpenerDirector的实例)。正常情况下,我们使用默认opener:通过urlopen。但你能够创建个性的openers。可以用build_opener来创建opener对象。一般可用于需要处理cookie或者不想进行redirection的应用场景(You will want to create openers if you want to fetch URLs with specific handlers installed, for example to get an opener that handles cookies, or to get an opener that does not handle redirections.)

  以下是用代理ip模拟登录时(需要处理cookie)使用handler和opener的具体流程。

1 self.proxy = urllib2.ProxyHandler({‘http‘: self.proxy_url})
2 self.cookie = cookielib.LWPCookieJar()
3 self.cookie_handler = urllib2.HTTPCookieProcessor(self.cookie)
4 self.opener = urllib2.build_opener(self.cookie_handler, self.proxy, urllib2.HTTPHandler)
Handles:

  Openers使用处理器handlers,所有的“繁重”工作由handlers处理。每个handlers知道如何通过特定协议打开URLs,或者如何处理URL打开时的各个方面。例如HTTP重定向或者HTTP cookies。

更多关于Openers和Handlers的信息。http://www.voidspace.org.uk/python/articles/urllib2.shtml#openers-and-handlers

10、Proxies
proxy代理ip创建opener

Note:Currently urllib2 does not support fetching of https locations through a proxy. This can be a problem.
(http://www.voidspace.org.uk/python/articles/urllib2.shtml#proxies)

例子:
1 import urllib2
2 proxy——handler = urllib2.ProxyHandler({‘http‘: ‘54.186.78.110:3128‘})#注意要确保该代理ip可用
3 opener = urllib2.build_opener(proxy_handler)
4 request = urllib2.Request(url, post_data, login_headers)#该例中还需要提交post_data和header信息
5 response = opener.open(request)
6 print response.read().encode(‘utf-8‘)

11、Sockets and Layers
例子:
import socket
import urllib2

# timeout in seconds
timeout = 10
socket.setdefaulttimeout(timeout)

# this call to urllib2.urlopen now uses the default timeout
# we have set in the socket module
req = urllib2.Request(‘http://www.voidspace.org.uk‘)
response = urllib2.urlopen(req)

12、Cookie
urllib2 对 Cookie 的处理也是自动的。如果需要得到某个 Cookie 项的值,可以这么做:
例子:
import urllib2
import cookielib
cookie = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
response = opener.open(‘http://www.baidu.com‘)
for item in cookie:
print ‘Name = ‘+item.name
print ‘Value = ‘+item.value

运行之后就会输出访问百度的Cookie值:
Name = BAIDUID
Value = C664216C4F7BD6B98DB0B300292E0A23:FG=1
Name = BIDUPSID
Value = C664216C4F7BD6B98DB0B300292E0A23
Name = H_PS_PSSID
Value = 1464_21099_17001_24879_22159
Name = PSTM
Value = 1510747061
Name = BDSVRTM
Value = 0
Name = BD_HOME
Value = 0
13、对付"反盗链"
某些站点有所谓的反盗链设置,其实说穿了很简单,
就是检查你发送请求的header里面,referer站点是不是他自己,
所以我们只需要像把headers的referer改成该网站即可,以baidu为例:
#...
headers = {
‘Referer‘:‘http://www.baidu.com/‘
}
#...
headers是一个dict数据结构,你可以放入任何想要的header,来做一些伪装。
例如,有些网站喜欢读取header中的X-Forwarded-For来看看人家的真实IP,可以直接把X-Forwarde-For改了。

时间: 2024-10-09 17:48:15

python urllib2库的简单总结的相关文章

python第三方库requests简单介绍

一.发送请求与传递参数 简单demo: import requests r = requests.get(url='http://www.itwhy.org') # 最基本的GET请求 print(r.status_code) # 获取返回状态 r = requests.get(url='http://dict.baidu.com/s', params={'wd':'python'}) #带参数的GET请求 print(r.url) print(r.text) #打印解码后的返回数据 1.带参数

Python turtle库绘制简单图形

一.简介 Python中的turtle库是一个直观有趣的图形绘制函数库.turtle库绘制图形有一个基本框架:一个小海龟在坐标系中爬行,其爬行轨迹形成了绘制图形. 二.简单的图形列举 1.绘制4个不同半径的同切圆 代码: import turtleturtle.pensize(4)turtle.circle(10)turtle.circle(40)turtle.circle(80)turtle.circle(120)turtle.done() 结果: 2.六角形的绘制,利用turtle库绘制一个

011 Python 爬虫库安装简单使用

# Python 爬虫基础知识 ● Python 爬虫基础知识 安装爬虫库 beautifulsoup4 pip install beautifulsoup4 lxml HTML 解析器 pip install html5lib html5lib pip install html5lib ● 使用库 设置 encoding='utf-8' 编码 1 # -*- coding: UTF-8 -*- 2 from bs4 import BeautifulSoup 3 import lxml 4 ht

Python 标准库 urllib2 的使用细节

转:http://www.cnblogs.com/yuxc/archive/2011/08/01/2123995.html Python 标准库中有很多实用的工具类,但是在具体使用时,标准库文档上对使用细节描述的并不清楚,比如 urllib2 这个 HTTP 客户端库.这里总结了一些 urllib2 库的使用细节. 1 Proxy 的设置 2 Timeout 设置 3 在 HTTP Request 中加入特定的 Header 4 Redirect 5 Cookie 6 使用 HTTP 的 PUT

【Python爬虫学习笔记(1)】urllib2库相关知识点总结

1. urllib2的opener和handler概念 1.1 Openers: 当你获取一个URL你使用一个opener(一个urllib2.OpenerDirector的实例).正常情况下,我们使用默认opener:通过urlopen.但你能够创建个性的openers.可以用build_opener来创建opener对象.一般可用于需要处理cookie或者不想进行redirection的应用场景(You will want to create openers if you want to f

Python:requests库、BeautifulSoup4库的基本使用(实现简单的网络爬虫)

Python:requests库.BeautifulSoup4库的基本使用(实现简单的网络爬虫) 一.requests库的基本使用 requests是python语言编写的简单易用的HTTP库,使用起来比urllib更加简洁方便. requests是第三方库,使用前需要通过pip安装. pip install requests 1.基本用法: import requests #以百度首页为例 response = requests.get('http://www.baidu.com') #res

5.Python爬虫入门三之Urllib2库的基本使用

1.分分钟爬一个网页下来 怎么爬网页呢?其实就是根据URL来获取它的网页信息,虽然我们在浏览器中看到的一幅幅优美的画面,但是其实是由浏览器解释才呈现出来的,实质它是一段HTML代码,加JS.CSS,如果把网页比作一个人,那么HTML便是他的骨架,JS便是他的肌肉,CSS便是他的衣服.所以最重要部分是存在于HTML中的,下面我们就写个例子来扒一个网页下来. import urllib2 response=urllib2.urlopen('https://www.baidu.com/') print

爬虫之urllib2库的基本使用

urllib2库的基本使用 所谓网页抓取,就是把URL地址中指定的网络资源从网络流中读取出来,保存到本地. 在Python中有很多库可以用来抓取网页,我们先学习urllib2. urllib2 是 Python2.7 自带的模块(不需要下载,导入即可使用) urllib2 官方文档:https://docs.python.org/2/library/urllib2.html urllib2 源码:https://hg.python.org/cpython/file/2.7/Lib/urllib2

Python Requests库:HTTP for Humans

Python标准库中用来处理HTTP的模块是urllib2,不过其中的API太零碎了,requests是更简单更人性化的第三方库. 用pip下载: pip install requests 或者git: git clone git://github.com/kennethreitz/requests.git 发送请求: GET方法 >>> import requests >>> r = requests.get('https://api.github.com/event