1 urllib.parse模块
Urllib.parse模块在urllib package中
引入
>>> from urllib import parse |
Urllib.parse模块的方法
>>> dir(parse) [‘DefragResult‘, ‘DefragResultBytes‘, ‘MAX_CACHE_SIZE‘, ‘ParseResult‘, ‘ParseResultBytes‘, ‘Quoter‘, ‘ResultBase‘, ‘SplitResult‘, ‘SplitResultBytes‘, ‘_ALWAYS_SAFE‘, ‘_ALWAYS_SAFE_BYTES‘, ‘_DefragResultBase‘, ‘_NetlocResultMixinBase‘, ‘_NetlocResultMixinBytes‘, ‘_NetlocResultMixinStr‘, ‘_ParseResultBase‘, ‘_ResultMixinBytes‘, ‘_ResultMixinStr‘, ‘_SplitResultBase‘, ‘__all__‘, ‘__builtins__‘, ‘__cached__‘, ‘__doc__‘, ‘__file__‘, ‘__loader__‘, ‘__name__‘, ‘__package__‘, ‘__spec__‘, ‘_asciire‘, ‘_coerce_args‘, ‘_decode_args‘, ‘_encode_result‘, ‘_hexdig‘, ‘_hextobyte‘, ‘_hostprog‘, ‘_implicit_encoding‘, ‘_implicit_errors‘, ‘_noop‘, ‘_parse_cache‘, ‘_portprog‘, ‘_safe_quoters‘, ‘_splitnetloc‘, ‘_splitparams‘, ‘_typeprog‘, ‘clear_cache‘, ‘collections‘, ‘namedtuple‘, ‘non_hierarchical‘, ‘parse_qs‘, ‘parse_qsl‘, ‘quote‘, ‘quote_from_bytes‘, ‘quote_plus‘, ‘re‘, ‘scheme_chars‘, ‘splitattr‘, ‘splithost‘, ‘splitnport‘, ‘splitpasswd‘, ‘splitport‘, ‘splitquery‘, ‘splittag‘, ‘splittype‘, ‘splituser‘, ‘splitvalue‘, ‘sys‘, ‘to_bytes‘, ‘unquote‘, ‘unquote_plus‘, ‘unquote_to_bytes‘, ‘unwrap‘, ‘urldefrag‘, ‘urlencode‘, ‘urljoin‘, ‘urlparse‘, ‘urlsplit‘, ‘urlunparse‘, ‘urlunsplit‘, ‘uses_fragment‘, ‘uses_netloc‘, ‘uses_params‘, ‘uses_query‘, ‘uses_relative‘] |
Urllib.parse模块urlparse方法,将url分拆为6个部件
‘http://blog.csdn.net/xu_clark‘ >>> parse.urlparse(url) ParseResult(scheme=‘http‘, netloc=‘blog.csdn.net‘, path=‘/xu_clark‘, params=‘‘, query=‘‘, fragment=‘‘) |
Urllib.parse模块urljoin方法,
urljoin(base, url, allow_fragments=True)
Join a base URL and a possibly relative URL to form an absolute
interpretation of the latter.
2 urllib.request模块
方法
>>> dir(request) [‘AbstractBasicAuthHandler‘, ‘AbstractDigestAuthHandler‘, ‘AbstractHTTPHandler‘, ‘BaseHandler‘, ‘CacheFTPHandler‘, ‘ContentTooShortError‘, ‘DataHandler‘, ‘FTPHandler‘, ‘FancyURLopener‘, ‘FileHandler‘, ‘HTTPBasicAuthHandler‘, ‘HTTPCookieProcessor‘, ‘HTTPDefaultErrorHandler‘, ‘HTTPDigestAuthHandler‘, ‘HTTPError‘, ‘HTTPErrorProcessor‘, ‘HTTPHandler‘, ‘HTTPPasswordMgr‘, ‘HTTPPasswordMgrWithDefaultRealm‘, ‘HTTPPasswordMgrWithPriorAuth‘, ‘HTTPRedirectHandler‘, ‘HTTPSHandler‘, ‘MAXFTPCACHE‘, ‘OpenerDirector‘, ‘ProxyBasicAuthHandler‘, ‘ProxyDigestAuthHandler‘, ‘ProxyHandler‘, ‘Request‘, ‘URLError‘, ‘URLopener‘, ‘UnknownHandler‘, ‘__all__‘, ‘__builtins__‘, ‘__cached__‘, ‘__doc__‘, ‘__file__‘, ‘__loader__‘, ‘__name__‘, ‘__package__‘, ‘__spec__‘, ‘__version__‘, ‘_cut_port_re‘, ‘_ftperrors‘, ‘_have_ssl‘, ‘_localhost‘, ‘_noheaders‘, ‘_opener‘, ‘_parse_proxy‘, ‘_proxy_bypass_macosx_sysconf‘, ‘_randombytes‘, ‘_safe_gethostbyname‘, ‘_thishost‘, ‘_url_tempfiles‘, ‘addclosehook‘, ‘addinfourl‘, ‘base64‘, ‘bisect‘, ‘build_opener‘, ‘collections‘, ‘contextlib‘, ‘email‘, ‘ftpcache‘, ‘ftperrors‘, ‘ftpwrapper‘, ‘getproxies‘, ‘getproxies_environment‘, ‘getproxies_registry‘, ‘hashlib‘, ‘http‘, ‘install_opener‘, ‘io‘, ‘localhost‘, ‘noheaders‘, ‘os‘, ‘parse_http_list‘, ‘parse_keqv_list‘, ‘pathname2url‘, ‘posixpath‘, ‘proxy_bypass‘, ‘proxy_bypass_environment‘, ‘proxy_bypass_registry‘, ‘quote‘, ‘re‘, ‘request_host‘, ‘socket‘, ‘splitattr‘, ‘splithost‘, ‘splitpasswd‘, ‘splitport‘, ‘splitquery‘, ‘splittag‘, ‘splittype‘, ‘splituser‘, ‘splitvalue‘, ‘ssl‘, ‘sys‘, ‘tempfile‘, ‘thishost‘, ‘time‘, ‘to_bytes‘, ‘unquote‘, ‘unquote_to_bytes‘, ‘unwrap‘, ‘url2pathname‘, ‘urlcleanup‘, ‘urljoin‘, ‘urlopen‘, ‘urlparse‘, ‘urlretrieve‘, ‘urlsplit‘, ‘urlunparse‘, ‘warnings‘] |
Urllib.request.urlopen方法;
Help on function urlopen in module urllib.request:
urlopen(url, data=None, timeout=<object object at 0x00168D70>, *, cafile=None, capath=None, cadefault=False, context=None)
http请求的get方法:
向web服务器发送的请求字符串(编码键值或引用)是urlstr的一部分
http请求的post方法:
请求的字符串(编码键值或引用)应该放在postQueryData变量中
如果http请求连接成功,返回一个文件类型对象(句柄)
句柄支持read(),readline(),readlines(),close(),fileno()方法,info()返回http请求文件的类型头文件,geturl考虑了url跳转,返回最后的url
>>> url=‘http://www.csdn.net/‘ >>> f=request.urlopen(url) >>> f.geturl() ‘http://www.csdn.net/‘ >>> f.info() <http.client.HTTPMessage object at 0x0301F2D0> >>> f.readline() b‘<!DOCTYPE HTML>\n‘ >>> f.readline() b‘<html>\n‘ >>> f.readline() b‘<head>\n‘ >>> f.close() >>> |
2.1 urllib.request.urlretrieve
将html下载获取到本地磁盘一个临时文件中
urlretrieve(url, filename=None, reporthook=None, data=None)
Retrieve a URL into a temporary location on disk.
Requires a URL argument. If a filename is passed, it is used as
the temporary file location. The reporthook argument should be
a callable that accepts a block number, a read size, and the
total file size of the URL target. The data argument should be
valid URL encoded data.
If a filename is passed and the URL points to a local resource,
the result is a copy from local file to new file.
Returns a tuple containing the path to the newly created
data file as well as the resulting HTTPMessage object.
>>> filename1=r‘D:\temp.html‘ >>> request1=request.urlretrieve(url,filename=filename1) >>> request1[0] ‘D:\\temp.html‘ >>> request1[1] <http.client.HTTPMessage object at 0x0301FF90> |
测试发现:下载的D:\temp.html包括图片等元素,较完整的下载
http.client.HTTPMessage对象的使用:
2.2 安全套接字层SSL
(1)urllib.request.ssl
(2)poplib.ssl
>>> help(poplib.ssl)
Help on module ssl:
NAME
ssl - This module provides some more Pythonic support for SSL.
Object types:
SSLSocket -- subtype of socket.socket which does SSL over the socket
(3)smtplib.ssl
>>> from smtplib import ssl
2.3 基础认证处理器request.HTTPBasicAuthHandler
(1)Builtins.object
class object
| The most base type
(2)urllib.request.BaseHandler
class BaseHandler(builtins.object)
| Methods defined here:
| __lt__(self, other)
| Return self<value.
| add_parent(self, parent)
| close(self)
(3)urllib.request.AbstractBasicAuthHandler
class AbstractBasicAuthHandler(builtins.object)
| Methods defined here:
| __init__(self, password_mgr=None)
| Initialize self. See help(type(self)) for accurate signature.
| http_error_auth_reqed(self, authreq, host, req, headers)
| http_request(self, req)
| http_response(self, req, response)
| https_response = http_response(self, req, response)
| retry_http_basic_auth(self, host, req, realm)
(4)request.HTTPBasicAuthHandler
class HTTPBasicAuthHandler(AbstractBasicAuthHandler, BaseHandler)
| Method resolution order:
| HTTPBasicAuthHandler
| AbstractBasicAuthHandler
| BaseHandler
| builtins.object
| Methods defined here:
| http_error_401(self, req, fp, code, msg, headers)
| ----------------------------------------------------------------------
| Data and other attributes defined here:
| auth_header = ‘Authorization‘
| ----------------------------------------------------------------------
| Methods inherited from AbstractBasicAuthHandler:
| __init__(self, password_mgr=None)
| Initialize self. See help(type(self)) for accurate signature.
| http_error_auth_reqed(self, authreq, host, req, headers)