一、随时随地爬取一个网页下来
怎么爬取网页?对网站开发了解的都知道,浏览器访问Url向服务器发送请求,服务器响应浏览器请求并返回一堆HTML信息,其中包括html标签,css样式,js脚本等。我们之前用的是Python标准基础库Urllib实现的,
现在我们使用Python的Requests HTTP库写个脚本开始爬取网页。Requests的口号很响亮“让HTTP服务人类“,够霸气。
二、Python Requests库的基本使用
1.GET和POST请求方式
GET请求
1 import requests 2 3 payload = {"t": "b", "w": "Python urllib"} 4 response = requests.get(‘http://zzk.cnblogs.com/s‘, params=payload) 5 # print(response.url) # 打印 http://zzk.cnblogs.com/s?w=Python+urllib&t=b&AspxAutoDetectCookieSupport=1 6 print(response.text)
Python requests的GET请求,不需要在作为请求参数前,对dict参数进行urlencode()和手动拼接到请求url后面,get()方法会直接对params参数这样做。
POST请求
1 import requests 2 3 payload = {"t": "b", "w": "Python urllib"} 4 response = requests.post(‘http://zzk.cnblogs.com/s‘, data=payload) 5 print(response.text) # u‘......‘
Python requests的POST请求,不需要在作为请求参数前,对dict参数进行urlencode()和encode()将字符串转换成字节码。raw属性返回的是字节码,text属性直接返回unicode格式的字符串,而不需要再进行decode()将返回的bytes字节码转化为unicode。
相对于Python urllib而言,Python requests更加简单易用。
2.设置请求头headers
1 import requests 2 3 payload = {"t": "b", "w": "Python urllib"} 4 headers = {‘user_agent‘:‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36‘} 5 response = requests.get(‘http://zzk.cnblogs.com/s‘, params=payload, headers=headers) 6 print(response.request.headers)
get方法的请求头,可以通过传递字典格式的参数给headers来实现。response.headers返回服务器响应的请求头信息,response.request.headers返回客户端的请求头信息。
3.设置会话cookie
1 import requests 2 3 cookies = {‘cookies_are‘: ‘working‘} 4 response = requests.get(‘http://zzk.cnblogs.com/‘, cookies=cookies) 5 print(response.text)
requests.get()方法cookies参数除了支持dict()字典格式,还支持传递一个复杂的RequestsCookieJar对象,可以指定域名和路径属性。
1 import requests 2 import requests.cookies 3 4 cookieJar = requests.cookies.RequestsCookieJar() 5 cookieJar.set(‘cookies_are‘, ‘working‘, domain=‘cnblogs‘, path=‘/cookies‘) 6 response = requests.get(‘http://zzk.cnblogs.com/‘, cookies=cookieJar) 7 print(response.text)
4.设置超时时间timeout
1 import requests 2 3 response = requests.get(‘http://zzk.cnblogs.com/‘, timeout=0.001) 4 print(response.text)
三、Python Requests库的高级使用
1.Session Object
1 from requests import Request,Session 2 3 s = Session() 4 5 s.get(‘http://httpbin.org/cookies/set/sessioncookie/123456789‘) 6 r = s.get(‘http://httpbin.org/cookies‘) 7 8 print(r.text) 9 # ‘{"cookies": {"sessioncookie": "123456789"}}‘
通过Session,我们可以在多个请求之间传递cookies信息,不过仅限于同一域名下,否则不会附带上cookie。如果碰到需要登录态的页面,我们可以在登陆的时候保存登录态,再访问其他页面时附带上就好。
2.Prepared Requested
1 from requests import Request,Session 2 3 url = ‘http://zzk.cnblogs.com/s‘ 4 payload = {"t": "b", "w": "Python urllib"} 5 headers = { 6 ‘user-agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36‘, 7 ‘Content-Type‘:‘application/x-www-form-urlencoded‘ 8 } 9 s = Session() 10 request = Request(‘GET‘, url, headers=headers, data=payload) 11 prepped = request.prepare() 12 13 # do something with prepped.headers 14 del prepped.headers[‘Content-Type‘] 15 response = s.send(prepped, timeout=3) 16 print(response.request.headers)
Request对象的prepare()方法返回的对象允许在发送请求前做些额外的工作,例如更新请求体body或者请求头headers.
四、Python Requests库的实际应用
1.GET请求封装
1 def do_get_request(self, url, headers=None, timeout=3, is_return_text=True, num_retries=2): 2 if url is None: 3 return None 4 print(‘Downloading:‘, url) 5 if headers is None: # 默认请求头 6 headers = { 7 ‘user-agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36‘} 8 response = None 9 try: 10 response = requests.get(url,headers=headers,timeout=timeout) 11 12 response.raise_for_status() # a 4XX client error or 5XX server error response,raise requests.exceptions.HTTPError 13 if response.status_code == requests.codes.ok: 14 if is_return_text: 15 html = response.text 16 else: 17 html = response.json() 18 else: 19 html = None 20 except requests.Timeout as err: 21 print(‘Downloading Timeout:‘, err.args) 22 html = None 23 except requests.HTTPError as err: 24 print(‘Downloading HTTP Error,msg:{0}‘.format(err.args)) 25 html = None 26 if num_retries > 0: 27 if 500 <= response.status_code < 600: 28 return self.do_get_request(url, headers=headers, num_retries=num_retries - 1) # 服务器错误,导致请求失败,默认重试2次 29 except requests.ConnectionError as err: 30 print(‘Downloading Connection Error:‘, err.args) 31 html = None 32 33 return html
2.POST请求封装
1 def do_post_request(self, url, data=None, headers=None, timeout=3, is_return_text=True, num_retries=2): 2 if url is None: 3 return None 4 print(‘Downloading:‘, url) 5 # 如果请求数据未空,直接返回 6 if data is None: 7 return 8 if headers is None: 9 headers = { 10 ‘user-agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36‘} 11 response = None 12 try: 13 response = requests.post(url,data=data, headers=headers, timeout=timeout) # 设置headers timeout无效 14 15 response.raise_for_status() # a 4XX client error or 5XX server error response,raise requests.exceptions.HTTPError 16 if response.status_code == requests.codes.ok: 17 if is_return_text: 18 html = response.text 19 else: 20 html = response.json() 21 else: 22 print(‘else‘) 23 html = None 24 except requests.Timeout as err: 25 print(‘Downloading Timeout:‘, err.args) 26 html = None 27 except requests.HTTPError as err: 28 print(‘Downloading HTTP Error,msg:{0}‘.format(err.args)) 29 html = None, 30 if num_retries > 0: 31 if 500 <= response.status_code < 600: 32 return self.do_post_request(url, data=data, headers=headers, 33 num_retries=num_retries - 1) # 服务器错误,导致请求失败,默认重试2次 34 except requests.ConnectionError as err: 35 print(‘Downloading Connection Error:‘, err.args) 36 html = None 37 38 return html
3.登录态cookie
1 def save_cookies(self, requeste_cookiejar, filename): 2 with open(filename, ‘wb‘)as f: 3 pickle.dump(requeste_cookiejar, f) 4 5 def load_cookies(self, filename): 6 with open(filename, ‘rb‘) as f: 7 return pickle.load(f) 8 9 # save request cookies 10 r = requests.get(url) 11 save_cookies(r.cookies,filename) 12 13 # load cookies and do a request 14 requests.get(url,cookies=load_cookies(filename))