（转）python下很帅气的爬虫包 - Beautiful Soup 示例

官方文档地址：http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

Beautiful Soup 相比其他的html解析有个非常重要的优势。html会被拆解为对象处理。全篇转化为字典和数组。

相比正则解析的爬虫，省略了学习正则的高成本。

相比xpath爬虫的解析，同样节约学习时间成本。虽然xpath已经简单点了。（爬虫框架Scrapy就是使用xpath）

安装

linux下可以执行

[plain] view plain copy

apt-get install python-bs4

也可以用python的安装包工具来安装

[html] view plain copy

easy_install beautifulsoup4
pip install beautifulsoup4

使用简介

下面说一下BeautifulSoup 的使用。

解析html需要提取数据。其实主要有几点

1：获取指定tag的内容。

[plain] view plain copy

hello, watsy hello, beautiful soup.

2：获取指定tag下的属性。

[html] view plain copy

<a href="http://blog.csdn.net/watsy">watsy‘s blog</a>

3：如何获取，就需要用到查找方法。

使用示例采用官方

[html] view plain copy

html_doc = """
<html><head><title>The Dormouse‘s story</title></head>
<body>
The Dormouse‘s story
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
...
"""

格式化输出。

[html] view plain copy

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
print(soup.prettify())
# <html>
# <head>
# <title>
# The Dormouse‘s story
# </title>
# </head>
# <body>
#
#
# The Dormouse‘s story
#
#
#
# Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">
# Elsie
# </a>
# ,
# <a class="sister" href="http://example.com/lacie" id="link2">
# Lacie
# </a>
# and
# <a class="sister" href="http://example.com/tillie" id="link2">
# Tillie
# </a>
# ; and they lived at the bottom of a well.
#
#
# ...
#
# </body>
# </html>

获取指定tag的内容

[html] view plain copy

soup.title
# <title>The Dormouse‘s story</title>
soup.title.name
# u‘title‘
soup.title.string
# u‘The Dormouse‘s story‘
soup.title.parent.name
# u‘head‘
soup.p
# The Dormouse‘s story
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

上面示例给出了4个方面

1：获取tag

soup.title

2：获取tag名称

soup.title.name

3：获取title tag的内容

soup.title.string

4：获取title的父节点tag的名称

soup.title.parent.name

怎么样，非常对象化的使用吧。

提取tag属性

下面要说一下如何提取href等属性。

[html] view plain copy

soup.p[‘class‘]
# u‘title‘

获取属性。方法是

soup.tag[‘属性名称‘]

[html] view plain copy

<a href="http://blog.csdn.net/watsy">watsy‘s blog</a>

常见的应该是如上的提取联接。

代码是

[html] view plain copy

soup.a[‘href‘]

相当easy吧。

查找与判断

接下来进入重要部分。全文搜索查找提取.

soup提供find与find_all用来查找。其中find在内部是调用了find_all来实现的。因此只说下find_all

[html] view plain copy

def find_all(self, name=None, attrs={}, recursive=True, text=None,
limit=None, **kwargs):

看参数。

第一个是tag的名称，第二个是属性。第3个选择递归，text是判断内容。limit是提取数量限制。**kwargs 就是字典传递了。。

举例使用。

[html] view plain copy

tag名称
soup.find_all(‘b‘)
# [The Dormouse‘s story]
正则参数
import re
for tag in soup.find_all(re.compile("^b")):
print(tag.name)
# body
# b
for tag in soup.find_all(re.compile("t")):
print(tag.name)
# html
# title
列表
soup.find_all(["a", "b"])
# [The Dormouse‘s story,
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
函数调用
def has_class_but_no_id(tag):
return tag.has_attr(‘class‘) and not tag.has_attr(‘id‘)
soup.find_all(has_class_but_no_id)
# [The Dormouse‘s story,
# Once upon a time there were...,
# ...]
tag的名称和属性查找
soup.find_all("p", "title")
# [The Dormouse‘s story]
tag过滤
soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
tag属性过滤
soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
text正则过滤
import re
soup.find(text=re.compile("sisters"))
# u‘Once upon a time there were three little sisters; and their names were\n‘

获取内容和字符串

获取tag的字符串

[html] view plain copy

title_tag.string
# u‘The Dormouse‘s story‘

注意在实际使用中应该使用 unicode(title_tag.string)来转换为纯粹的string对象

使用strings属性会返回soup的构造1个迭代器，迭代tag对象下面的所有文本内容

[html] view plain copy

for string in soup.strings:
print(repr(string))
# u"The Dormouse‘s story"
# u‘\n\n‘
# u"The Dormouse‘s story"
# u‘\n\n‘
# u‘Once upon a time there were three little sisters; and their names were\n‘
# u‘Elsie‘
# u‘,\n‘
# u‘Lacie‘
# u‘ and\n‘
# u‘Tillie‘
# u‘;\nand they lived at the bottom of a well.‘
# u‘\n\n‘
# u‘...‘
# u‘\n‘

获取内容

.contents会以列表形式返回tag下的节点。

[html] view plain copy

head_tag = soup.head
head_tag
# <head><title>The Dormouse‘s story</title></head>
head_tag.contents
[<title>The Dormouse‘s story</title>]
title_tag = head_tag.contents[0]
title_tag
# <title>The Dormouse‘s story</title>
title_tag.contents
# [u‘The Dormouse‘s story‘]

想想，应该没有什么其他的了。。其他的也可以看文档学习使用。

总结

其实使用起主要是

[html] view plain copy

soup = BeatifulSoup(data)
soup.title
soup.p.[‘title‘]
divs = soup.find_all(‘div‘, content=‘tpc_content‘)
divs[0].contents[0].string

转自 http://blog.csdn.net/watsy/article/details/14161201

（转）python下很帅气的爬虫包 - Beautiful Soup 示例

时间： 2024-10-11 12:39:36

（转）python下很帅气的爬虫包 - Beautiful Soup 示例的相关文章

python下的复杂网络编程包networkx的安装及使用

由于py3.x与工具包的兼容问题,这里采用py2.7 1.python下的复杂网络编程包networkx的使用: http://blog.sina.com.cn/s/blog_720448d301018px7.html 处理1里面提到的那四个安装包还要: 2.需要安装 setuptools: http://wenku.baidu.com/link?url=XL2qKVZbDPh-XocJW7OVZmacM4Tio5YhCyu0Uw-E7CjhiXRrhSWI4xheERjEVC3olCZ8muN

爬虫之Beautiful Soup

了解Beautiful Soup 中文文档: Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式安装 beautifulsoup4 >: pip install beautifulsoup4 解析器 Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml: $ apt-get i

python下用selenium的webdriver包如何在执行完点击下一页后没有获得下一页新打开页面的html源代码

问题描述: 新打开的页面url不变,只是网页内容变了,然后使用drive.page_source得到的都是第一页的html代码,并不是当前页面的html代码. 1. 原因:webdriver仍默认在原页面下获取标签等信息: 解决方法:采用切换页面句柄的方式解决: 2. 原因:缺少time.sleep(1),如果短了就无法正常获取数据,所以检查适当位置是否有停留 def get_info(num): driver.get(url) driver.implicitly_wait(10) # 隐式

Python网络爬虫 - 2. Beautiful Soup小试牛刀

目标: 我们解析百度首页的logo bs_baidu_logo.py from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://www.baidu.com") bsObj = BeautifulSoup(html.read(), "html.parser") print(bsObj.img) 运行结果: <img height="12

爬虫5:Beautiful Soup的css选择器

学习于:http://cuiqingcai.com/1319.html 用到的方法是 soup.select(),返回类型是 list,用 get_text() 方法来获取它的内容 (1)通过标签名查找 print soup.select('title') print soup.select('a') print soup.select('b') (2)通过类名查找 print soup.select('.sister') (3)通过 id 名查找 print soup.select('#li

Python的html和xml解析库Beautiful Soup

【爬虫】beautiful soup笔记（待填坑）

Beautiful Soup是一个第三方的网页解析的模块.其遵循的接口为Document Tree,将网页解析成为一个树形结构. 其使用步骤如下: 1.创建对象:根据网页的文档字符串 2.搜索节点:名称.属性.文字. 3.处理节点: BeautifulSoup(文档字符串, 'html.parser' 解析器,from_encoding='utf8') find_all(名称,属性,文字):可以传入字符串也可以传入正则表达式. node.name 名称 node['href'] 属性 node

python爬虫之解析库Beautiful Soup

Beautiful Soup4操作为何要用Beautiful Soup Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式, 是一个标签的形式,来进行查找的,有点像jquery的形式.提升效率,我们在进行爬虫开发的时候,进程会用到正则来进行查找过滤的操作,纯手动会及其浪费时间. Beautiful Soup示例摘自官网 html_doc = """ <html>

Python下科学计算包numpy和SciPy的安装

转载自:http://blog.sina.com.cn/s/blog_62dfdc740101aoo6.html Python下大多数工具包的安装都很简单,只需要执行 “python setup.py install”命令即可.然而,由于SciPy和numpy这两个科学计算包的依赖关系较多,安装过程较为复杂.网上教程较为混乱,而且照着做基本都不能用.在仔细研读各个包里的README和INSTALL之后,终于安装成功.现记录如下. 系统环境: OS:RedHat5 Python版本:Python2