Python如何访问互联网
URL + lib --> urllib
URL的一般格式为
protocol://hostname[:port]/path/[;parameters][?query]#fragment
URL由三部分组成
第一部分是协议:http,https,ftp,file,ed2k......
第二部分是存放资源服务器的域名系统或IP地址(有时候要包含端口号,各种传输协议都有默认的端口号,如http的默认端口为80)
第三部分是资源的具体地址,如目录或文件名等
urllib包含四个模块
urllib.request for opening and reading URLs
urllib.error containing the exceptions raised by urllib.request
urllib.parse for parsing URLS
urllib.robotparser for parsing robots.txt files
urllib.request.urlopen(url,data = None,[timeout,]*,cafile = None,capath=None,cadefault = False)
Open the URL url,which can be either a string or a Request object.
>>> import urllib.request >>> response = urllib.request.urlopen(‘http://www.weparts.net‘) >>> html = response,read() >>> print(html.decode(‘utf-8‘))
实战
import urllib.request response = urllib.request.urlopen(‘http://placekitten.com/g/500/600‘) cat_img = response.read() with open(‘cat_500_600.jpg‘,‘wb‘) as f: f.write(cat_img)
import urllib.request req = urllib.request.Request(‘http://placekitten.com/g/500/600‘) response = urllib.request.urlopen(req) cat_img = response.read() with open(‘cat_500_600.jpg‘,‘wb‘) as f: f.write(cat_img)
>>>response.geturl() ‘http://placeketten.com/g/500/600‘
>>>response.info() <bound method HTTPResponse.geturl of <http.client.HTTPResponse object at >>>print(response.info()) 0x7fe88d136f60>> Date: Thu, 14 Sep 2017 08:10:46 GMT Content-Type: image/jpeg Content-Length: 26590 Connection: close Set-Cookie: __cfduid=dc52691cf479658e05d15824990dabeb11505376646; expires=Fri, 14-Sep-18 08:10:46 GMT; path=/; domain=.placekitten.com; HttpOnly Accept-Ranges: bytes X-Powered-By: PleskLin Access-Control-Allow-Origin: * Cache-Control: public Expires: Thu, 31 Dec 2020 20:00:00 GMT Server: cloudflare-nginx CF-RAY: 39e1df2a94ee77a2-LAX
>>>response.getcode() 200
data urllib .parse.urlencode() function takes a mapping or sequence of 2-tuples and returns a string in this format
import urllib.request url = ‘http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule&sessionFrom=‘ data= {} data[‘i‘] = ‘I love Junjie‘ data[‘from‘] = ‘AUTO‘ data[‘to‘] = ‘AUTO‘ data[‘smartresult‘] = ‘dict‘ data[‘client‘] = ‘fanyideskweb‘ data[‘salt‘] = ‘1505376958945‘ data[‘sign‘] = ‘86bb3d2294c81c8d6718e800f939bf45‘ data[‘doctype‘] = ‘json‘ data[‘version‘] = ‘2.1‘ data[‘keyfrom‘] = ‘fanyi.web‘ data[‘action‘] = ‘FY_BY_CLICKBUTTION‘ data[‘typoResult‘] = ‘true‘ data = urllib.parse.urlencode(data).encode(‘utf-8‘) response = urllib.request.urlopen(url,data) html = response.read().decode(‘utf-8‘) print(html)
import json json,loads(html) #得到的就是一个字典
隐藏
urllib.request.Request(url,data = None, headers = {},origin_req_host = None,unverifiable = False, method = None)
headers should be a dictionary
add_header()