python爬虫抓取数据

URL管理器实现方式：
1. 内存
python内存
待爬取URL集合：set()
已爬取URL集合：set()

2. 关系数据库
MySQL
urls(url, is_crawled)

3. 缓存数据库（高性能，大公司存储）
redis
待爬取URL集合：set
已爬取URL集合：set

网页下载器
urllib2 python官方基础模块
requests 第三方包更强大

import urllib2

urllib2下载网页方法一：
###########################
#直接请求
response = urllib2.urlopen(‘http://www.baidu.com‘)

#获取状态码，如果是200表示获取成功
print response.getcode()

#读取内容
cont = response.read()

############################

urllib2下载网页方法2：
添加data、http header

############################
import urllib2

# 创建Request对象
request = urllib2.Request(url)

# 添加数据
request.add_data(‘a‘, ‘1‘)
# 添加http的header
request.add_header(‘User-Agent‘, ‘Mozilla/5.0‘)

# 发送请求获取结果
response = urllib2.urlopen(request)
############################

urllib2下载网页方法3：
添加特殊情景的处理器
HTTPCookieProcessor
ProxyHandler
HTTPSHandler
HTTPRedirectHandler

urllib2下载网页的三种方法：

网页解析器
从网页中提取有价值数据的工具
1. 正则表达式（复杂，模糊匹配）
1. html.parser
2. Beautiful Soup （第三方插件，强大）
3. lxml

Beautiful Soup
Python第三方库，用于从HTML或XML中提取数据
官网：https://www.crummy.com/software/BeautifulSoup/

安装Beautiful Soup

Beautiful Soup语法
1. 根据Html网页，创建BeautifulSoup对象
2. 搜索节点 find_all、find（可以按节点名称、节点属性值、节点文字进行搜索）
3. 然后就可以访问节点的名称、属性、文字

# 创建BeautifulSoup对象
from bs4 import BeautifulSoup

# 根据HTML网页字符串创建BeautifulSoup对象
soup = BeautifulSoup(
html_doc, # HTML文档字符串
‘html.parser‘ #HTML解析器
from_encoding=‘utf8‘ #HTML文档的编码
)

# 搜索节点(find_all, find)
find_all(name, attrs, string)

# 查找所有标签为a的节点
soup.find_all(‘a‘)

# 查找所有标签为a，链接符合/view/123.htm形式的节点
soup.find_all(‘a‘, href=‘/view/123.htm‘)

# <a href=‘123.htm‘ class=‘abc‘>Python</a>

# 查找所有标签为div，class为abc，文字为Python的节点
soup.find_all(‘div‘, class_=‘abc‘, string=‘Python‘)

访问节点的信息：
# 得到节点：<a href=‘1.html‘>Python</a>

# 获取查找到的节点的标签名称
node.name

# 获取查找到的a节点的href属性
node[‘href‘]

# 获取查找到的a节点的链接文字
node.get_text()

时间： 2024-08-27 13:15:12

python 爬虫抓取心得