python网络入门：urllib.request模块和urllib.urllib.parse模块

*************************************************

** 转发请注明原文，尊重原创

** 原文来自：blog.csdn.net/clark_xu 徐长亮的专栏

*************************************************

1 urllib.parse模块

Urllib.parse模块在urllib package中

引入

>>> from urllib import parse

Urllib.parse模块的方法

>>> dir(parse)

[‘DefragResult‘, ‘DefragResultBytes‘, ‘MAX_CACHE_SIZE‘, ‘ParseResult‘, ‘ParseResultBytes‘, ‘Quoter‘, ‘ResultBase‘, ‘SplitResult‘, ‘SplitResultBytes‘, ‘_ALWAYS_SAFE‘, ‘_ALWAYS_SAFE_BYTES‘, ‘_DefragResultBase‘, ‘_NetlocResultMixinBase‘, ‘_NetlocResultMixinBytes‘, ‘_NetlocResultMixinStr‘, ‘_ParseResultBase‘, ‘_ResultMixinBytes‘, ‘_ResultMixinStr‘, ‘_SplitResultBase‘, ‘__all__‘, ‘__builtins__‘, ‘__cached__‘, ‘__doc__‘, ‘__file__‘, ‘__loader__‘, ‘__name__‘, ‘__package__‘, ‘__spec__‘, ‘_asciire‘, ‘_coerce_args‘, ‘_decode_args‘, ‘_encode_result‘, ‘_hexdig‘, ‘_hextobyte‘, ‘_hostprog‘, ‘_implicit_encoding‘, ‘_implicit_errors‘, ‘_noop‘, ‘_parse_cache‘, ‘_portprog‘, ‘_safe_quoters‘, ‘_splitnetloc‘, ‘_splitparams‘, ‘_typeprog‘, ‘clear_cache‘, ‘collections‘, ‘namedtuple‘, ‘non_hierarchical‘, ‘parse_qs‘, ‘parse_qsl‘, ‘quote‘, ‘quote_from_bytes‘, ‘quote_plus‘, ‘re‘, ‘scheme_chars‘, ‘splitattr‘, ‘splithost‘, ‘splitnport‘, ‘splitpasswd‘, ‘splitport‘, ‘splitquery‘, ‘splittag‘, ‘splittype‘, ‘splituser‘, ‘splitvalue‘, ‘sys‘, ‘to_bytes‘, ‘unquote‘, ‘unquote_plus‘, ‘unquote_to_bytes‘, ‘unwrap‘, ‘urldefrag‘, ‘urlencode‘, ‘urljoin‘, ‘urlparse‘, ‘urlsplit‘, ‘urlunparse‘, ‘urlunsplit‘, ‘uses_fragment‘, ‘uses_netloc‘, ‘uses_params‘, ‘uses_query‘, ‘uses_relative‘]

Urllib.parse模块urlparse方法，将url分拆为6个部件

‘http://blog.csdn.net/xu_clark‘

>>> parse.urlparse(url)

ParseResult(scheme=‘http‘, netloc=‘blog.csdn.net‘, path=‘/xu_clark‘, params=‘‘, query=‘‘, fragment=‘‘)

Urllib.parse模块urljoin方法，

urljoin(base, url, allow_fragments=True)

Join a base URL and a possibly relative URL to form an absolute

interpretation of the latter.

2 urllib.request模块

方法

>>> dir(request)

[‘AbstractBasicAuthHandler‘, ‘AbstractDigestAuthHandler‘, ‘AbstractHTTPHandler‘, ‘BaseHandler‘, ‘CacheFTPHandler‘, ‘ContentTooShortError‘, ‘DataHandler‘, ‘FTPHandler‘, ‘FancyURLopener‘, ‘FileHandler‘, ‘HTTPBasicAuthHandler‘, ‘HTTPCookieProcessor‘, ‘HTTPDefaultErrorHandler‘, ‘HTTPDigestAuthHandler‘, ‘HTTPError‘, ‘HTTPErrorProcessor‘, ‘HTTPHandler‘, ‘HTTPPasswordMgr‘, ‘HTTPPasswordMgrWithDefaultRealm‘, ‘HTTPPasswordMgrWithPriorAuth‘, ‘HTTPRedirectHandler‘, ‘HTTPSHandler‘, ‘MAXFTPCACHE‘, ‘OpenerDirector‘, ‘ProxyBasicAuthHandler‘, ‘ProxyDigestAuthHandler‘, ‘ProxyHandler‘, ‘Request‘, ‘URLError‘, ‘URLopener‘, ‘UnknownHandler‘, ‘__all__‘, ‘__builtins__‘, ‘__cached__‘, ‘__doc__‘, ‘__file__‘, ‘__loader__‘, ‘__name__‘, ‘__package__‘, ‘__spec__‘, ‘__version__‘, ‘_cut_port_re‘, ‘_ftperrors‘, ‘_have_ssl‘, ‘_localhost‘, ‘_noheaders‘, ‘_opener‘, ‘_parse_proxy‘, ‘_proxy_bypass_macosx_sysconf‘, ‘_randombytes‘, ‘_safe_gethostbyname‘, ‘_thishost‘, ‘_url_tempfiles‘, ‘addclosehook‘, ‘addinfourl‘, ‘base64‘, ‘bisect‘, ‘build_opener‘, ‘collections‘, ‘contextlib‘, ‘email‘, ‘ftpcache‘, ‘ftperrors‘, ‘ftpwrapper‘, ‘getproxies‘, ‘getproxies_environment‘, ‘getproxies_registry‘, ‘hashlib‘, ‘http‘, ‘install_opener‘, ‘io‘, ‘localhost‘, ‘noheaders‘, ‘os‘, ‘parse_http_list‘, ‘parse_keqv_list‘, ‘pathname2url‘, ‘posixpath‘, ‘proxy_bypass‘, ‘proxy_bypass_environment‘, ‘proxy_bypass_registry‘, ‘quote‘, ‘re‘, ‘request_host‘, ‘socket‘, ‘splitattr‘, ‘splithost‘, ‘splitpasswd‘, ‘splitport‘, ‘splitquery‘, ‘splittag‘, ‘splittype‘, ‘splituser‘, ‘splitvalue‘, ‘ssl‘, ‘sys‘, ‘tempfile‘, ‘thishost‘, ‘time‘, ‘to_bytes‘, ‘unquote‘, ‘unquote_to_bytes‘, ‘unwrap‘, ‘url2pathname‘, ‘urlcleanup‘, ‘urljoin‘, ‘urlopen‘, ‘urlparse‘, ‘urlretrieve‘, ‘urlsplit‘, ‘urlunparse‘, ‘warnings‘]

Urllib.request.urlopen方法；

Help on function urlopen in module urllib.request:

urlopen(url, data=None, timeout=<object object at 0x00168D70>, *, cafile=None, capath=None, cadefault=False, context=None)

http请求的get方法：

向web服务器发送的请求字符串（编码键值或引用）是urlstr的一部分

http请求的post方法：

请求的字符串（编码键值或引用）应该放在postQueryData变量中

如果http请求连接成功，返回一个文件类型对象（句柄）

句柄支持read(),readline(),readlines(),close(),fileno()方法，info()返回http请求文件的类型头文件,geturl考虑了url跳转，返回最后的url

>>> url=‘http://www.csdn.net/‘

>>> f=request.urlopen(url)

>>> f.geturl()

‘http://www.csdn.net/‘

>>> f.info()

<http.client.HTTPMessage object at 0x0301F2D0>

>>> f.readline()

b‘<!DOCTYPE HTML>\n‘

>>> f.readline()

b‘<html>\n‘

>>> f.readline()

b‘<head>\n‘

>>> f.close()

>>>

2.1 urllib.request.urlretrieve

将html下载获取到本地磁盘一个临时文件中

urlretrieve(url, filename=None, reporthook=None, data=None)

Retrieve a URL into a temporary location on disk.

Requires a URL argument. If a filename is passed, it is used as

the temporary file location. The reporthook argument should be

a callable that accepts a block number, a read size, and the

total file size of the URL target. The data argument should be

valid URL encoded data.

If a filename is passed and the URL points to a local resource,

the result is a copy from local file to new file.

Returns a tuple containing the path to the newly created

data file as well as the resulting HTTPMessage object.

>>> filename1=r‘D:\temp.html‘

>>> request1=request.urlretrieve(url,filename=filename1)

>>> request1[0]

‘D:\\temp.html‘

>>> request1[1]

<http.client.HTTPMessage object at 0x0301FF90>

测试发现：下载的D:\temp.html包括图片等元素，较完整的下载

http.client.HTTPMessage对象的使用：

2.2 安全套接字层SSL

（1）urllib.request.ssl

（2）poplib.ssl

>>> help(poplib.ssl)

Help on module ssl:

NAME

ssl - This module provides some more Pythonic support for SSL.

Object types:

SSLSocket -- subtype of socket.socket which does SSL over the socket

（3）smtplib.ssl

>>> from smtplib import ssl

2.3 基础认证处理器request.HTTPBasicAuthHandler

（1）Builtins.object

class object

| The most base type

（2）urllib.request.BaseHandler

class BaseHandler(builtins.object)

| Methods defined here:

| __lt__(self, other)

| Return self<value.

| add_parent(self, parent)

| close(self)

（3）urllib.request.AbstractBasicAuthHandler

class AbstractBasicAuthHandler(builtins.object)

| Methods defined here:

| __init__(self, password_mgr=None)

| Initialize self. See help(type(self)) for accurate signature.

| http_error_auth_reqed(self, authreq, host, req, headers)

| http_request(self, req)

| http_response(self, req, response)

| https_response = http_response(self, req, response)

| retry_http_basic_auth(self, host, req, realm)

（4）request.HTTPBasicAuthHandler

class HTTPBasicAuthHandler(AbstractBasicAuthHandler, BaseHandler)

| Method resolution order:

| HTTPBasicAuthHandler

| AbstractBasicAuthHandler

| BaseHandler

| builtins.object

| Methods defined here:

| http_error_401(self, req, fp, code, msg, headers)

| ----------------------------------------------------------------------

| Data and other attributes defined here:

| auth_header = ‘Authorization‘

| ----------------------------------------------------------------------

| Methods inherited from AbstractBasicAuthHandler:

| __init__(self, password_mgr=None)

| Initialize self. See help(type(self)) for accurate signature.

| http_error_auth_reqed(self, authreq, host, req, headers)

时间： 2024-10-26 10:29:53

python网络入门：urllib.request模块和urllib.urllib.parse模块

1 urllib.parse模块

2 urllib.request模块

2.1 urllib.request.urlretrieve

2.2 安全套接字层SSL

2.3 基础认证处理器request.HTTPBasicAuthHandler

python网络入门：urllib.request模块和urllib.urllib.parse模块的相关文章

python爬虫入门（1）-urllib模块

Python 3.X 要使用urllib.request 来抓取网络资源。转

Python开发【模块】：Urllib（二）

爬虫小探-Python3 urllib.request获取页面数据

笔记之Python网络数据采集

Python NLP入门教程

《Python网络数据采集》读书笔记（一）

Python网络爬虫：爬取古诗文中的某个制定诗句来实现搜索

Python爬虫入门这一篇就够了