《python3网络爬虫开发实战》--基本库的使用

1. urllib:

request:它是最基本的 HTTP 请求模块，可以用来模拟发送请求。就像在浏览器里输入网挝然后回车一样，只需要给库方法传入 URL 以及额外的参数，就可以模拟实现这个过程了。
error:
parse:一个工具模块，提供了许多 URL处理方法，比如拆分、解析、合并等。
robotparser:主要是用来识别网站的 robots.txt文件，然后判断哪些网站可以爬，哪些网站不可以爬，它其实用得比较少。

2. Handle类：

当需要实现高级的功能时，使用Handle

 1 import http.cookiejar,urllib.request
 2
 3 filename = ‘cookies.txt‘
 4 #cookie = http.cookiejar.CookieJar
 5 #cookie = http.cookiejar.MozillaCookieJar(filename)
 6 cookie = http.cookiejar.LWPCookieJar(filename)
 7 cookie.load(‘cookies.txt‘, ignore_discard=True, ignore_expires=True)
 8 handle = urllib.request.HTTPCookieProcessor(cookie)
 9 opener = urllib.request.build_opener(handle)
10 response = opener.open(‘http://www.baidu.com‘)
11 #for item in cookie:
12    # print(item.name+"="+item.value)
13
14 #cookie.save(ignore_discard=True, ignore_expires=True)
15 print(response.read().decode(‘utf-8‘))

3. urljoin

我们可以提供一个 base_url (基础链接 )作为第一个参数，将新的链接作为第二个参数，该方法会分析 base_url 的 scheme、 netloc 和 path这 3个内容并对新链接缺失的部分进行补充，最后返回结果。

4. urlencode()

1 from urllib.parse import urlencode
2
3 params = {
4     ‘name‘: ‘germey‘,
5     ‘age‘: ‘23‘
6 }
7 base_url = ‘http://www.baidu.com?‘
8 url = base_url+urlencode(params)
9 print(url)

5.parse_qs

反序列化,将get请求的参数，转回字典

1 from urllib.parse import parse_qs
2 query= ‘name=germey&age=22‘
3 print(parse_qs(query))

parse_qsl:转化为元组组成的列表

1 from urllib.parse import parse_qsl
2 print(parse_qsl(query))

6. quote

将内容转化为URL编码模式

7.分析Robots协议

1. robots协议

Robots 协议也称作爬虫协议、机器人协议，它的全名叫作网络爬虫排除标准( Robots ExclusionProtocol)，用来告诉爬虫和搜索引擎哪些页面可以抓取，哪些不可以抓取。它通常是一个叫作 robots.txt的文本文件，一般放在网站的根目录下。

2. robotparser

set_url:用来设置 robots.txt 文件的链接。如果在创建 RobotFileParser 对象时传入了链接，那么就不需要再使用这个方法设置了

read:读取 robots.txt 文件并进行分析。注意，这个方法执行一个读取和分析操作，如果不调用这个方法，接下来的判断都会为 False，所以一定记得调用这个方法。这个方法不会返回任何内容，但是执行了读取操作。

parse:用来解析robots.txt文件，传人的参数是robots.txt某些行的内容，它会按照robots.txt的语法规则来分析这些内容。

can_fetch:该方法传人两个参数，第一个是 User-agent，第二个是要抓取的 URL。返回的内容是该搜索引擎是否可以抓取这个 URL，返回结果是 True 或 Falsea

mtime:返回的是上次抓取和分析 robots.txt的时间，这对于长时间分析和抓取的搜索爬虫是很有必要的，你可能需要定期检查来抓取最新的 robots.txt。

modified:

1 from urllib.robotparser import RobotFileParser
2
3 rp = RobotFileParser()
4 rp.set_url(‘http://www.jianshu.com/robots.txt‘)
5 rp.read()
6 print(rp.can_fetch(‘*‘, ‘http://www.jianshu.com/p/b67554025d7d‘))
7 print(rp.can_fetch(‘*‘, ‘http://www.jianshu.com/search?q=python&page=l&type=collections‘))

8. requests

它同样对长时间分析和抓取的搜索爬虫很有帮助，将当前时间设置为上次抓取和分析 robots.txt 的时间。

1. get:

 1 import requests
 2 import re
 3
 4 #浏览器标时，如果没有，会禁止爬取
 5 headers = {
 6     ‘User-Agent‘:‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36‘
 7 }
 8 r = requests.get("http://www.zhihu.com/explore",headers=headers)
 9 pattern = re.compile(‘explore-feed.*?question.*?>(.*?)</a>‘,re.S)
10 titles = re.findall(pattern, r.text)
11 print(titles)
12
13 r = requests.get("http://github.com/favicon.ico")
14 with open(‘favicon.ico‘,‘wb‘) as f:
15     f.write(r.content)

2. post:

 1 import requests
 2
 3 data = {
 4     ‘name‘: ‘name‘,
 5     ‘age‘: ‘22‘
 6 }
 7 r = requests.post("http://httpbin.org/post", data=data)
 8 print(r.text)
 9 r = requests.get(‘http://www.zhihu.com‘)
10 print(type(r.status_code), r.status_code)#得到状态码
11 print(type(r.headers), r.headers)#得到响应头
12 print(type(r.cookies), r.cookies)#得到cookies
13 print(type(r.url), r.url)#得到URL
14 print(type(r.history), r.history)#得到请求历史

9. request的高级语法：

1.文件上传：

2. cookies:

 1 import requests
 2
 3 files = {‘file‘:open(‘favicon.ico‘, ‘rb‘)}
 4 r = requests.post("http://httpbin.org/post", files=files)
 5 print(r.text)
 6 r = requests.get("http://www.baidu.com")
 7 print(r.cookies)
 8 for key, value in r.cookies.items():
 9     print(key + ‘=‘ + value)
10
11 headers = {
12     ‘Cookies‘: ‘tst=r; __utma=51854390.2112264675.1539419567.1539419567.1539433913.2; __utmb=51854390.0.10.1539433913; __utmc=51854390; __utmv=51854390.100--|2=registration_date=20160218=1^3=entry_date=20160218=1; __utmz=51854390.1539433913.2.2.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/; tgw_l7_route=e0a07617c1a38385364125951b19eef8; q_c1=d3c7341e344d460ead79171d4fd56f6f|1539419563000|1516290905000; _xsrf=713s0UsLfr6m5Weplwb4offGhSqnugCy; z_c0="2|1:0|10:1533128251|4:z_c0|92:Mi4xS2VDaEFnQUFBQUFBZ09DVGo1ZUtEU1lBQUFCZ0FsVk5PX3hPWEFEVXNtMXhSbmhjbG5NSjlHQU9naEpLbkwxYlpB|e71c25127cfb23241089a277f5d7c909165085f901f9d58cf93c5d7ec7420217"; d_c0="AIDgk4-Xig2PTlryga7LwT30h_-3DUHnGbc=|1525419053"; __DAYU_PP=zYA2JUmBnVe2bBjq7qav2ac8d8025bbd; _zap=d299f20c-20cc-4202-a007-5dd6863ccce9‘,
13     ‘Host‘: ‘www.zhihu.com‘,
14     ‘User-Agent‘: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36‘,
15
16 }
17 r = requests.get("http://www.zhihu.com",headers=headers)
18 print(r.text)

3. 会话维持:

1 import requests
2
3 requests.get("http://httpbin.org/cookies/set/umber/123456789")
4 r = requests.get("http://httpbin.org/cookies")
5 print(r.text)
6 s = requests.Session()
7 s.get("http://httpbin.org/cookies/set/umber/123456789")
8 r = s.get(‘http://httpbin.org/cookies‘)
9 print(r.text)

 1 {
 2   "cookies": {}
 3 }
 4
 5 {
 6   "cookies": {
 7     "umber": "123456789"
 8   }
 9 }
10
11
12 Process finished with exit code 0

4. SSl证书验证

requests还提供了证书验证的功能。当发送 HTTP请求的时候，它会检查 SSL证书，我们可以使用 verify参数控制是否检查此证书。其实如果不加 verify参数的话，默认是 True，会自动验证。

1 import requests
2 #from requests.packages import urllib3
3 import logging
4
5 logging.captureWarnings(True)
6 #urllib3.disable_warnings()
7 response = requests.get(‘https://www.12306.cn‘, verify=False)
8 print(response.status_code)

5. 代理设置

6. 超时设置

r = requests.get(‘http://www.taobao.com‘, timeout=1)

7. 身份认证

1 import requests
2 from requests.auth import HTTPBasicAuth
3
4 r = requests.get(‘http://localhost:5000‘, auth=HTTPBasicAuth(‘username‘, ‘password‘))
5 print(r.status_code)

8.Prepared Request:将请求表示为数据结构

from requests import Request, Session

url = ‘http://httpbin.org/post‘
data = {
    ‘name‘: ‘germey‘
}
headers = {
    ‘User-Agent‘: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36‘
}
s = Session()
req = Request(‘POST‘,url,data=data,headers=headers)
prepped = s.prepare_request(req)
r = s.send(prepped)
print(r.text)

10. 正则表达式：

https://www.cnblogs.com/chengchengaqin/p/9708044.html

原文地址：https://www.cnblogs.com/chengchengaqin/p/9784229.html

时间： 2024-11-08 04:37:23

《python3网络爬虫开发实战》--基本库的使用

《python3网络爬虫开发实战》--基本库的使用的相关文章

[Python3网络爬虫开发实战] 1.8.2-Scrapy的安装

【Python3网络爬虫开发实战】分析Ajax爬取今日头条街拍美图

[Python3网络爬虫开发实战] 1.5.2-PyMongo的安装

[Python3网络爬虫开发实战] 1.3.1-lxml的安装

[Python3网络爬虫开发实战] 1.3.3-pyquery的安装

[Python3网络爬虫开发实战] 1.2.6-aiohttp的安装

[Python3网络爬虫开发实战] 1.5.3-redis-py的安装

《python3网络爬虫开发实战》--Scrapy

[Python3网络爬虫开发实战] 1.7.1-Charles的安装