Python开发【模块】：BeautifulSoup

BeautifulSoup

BeautifulSoup是一个模块，该模块用于接收一个HTML或XML字符串，然后将其进行格式化，之后遍可以使用他提供的方法进行快速查找指定元素，从而使得在HTML或XML中查找指定元素变得简单

1、安装：

pip3 install beautifulsoup4
pip install  lxml   # python2.x

2、简单使用：

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse‘s story</title></head>
<body>
asdf
    <div class="title">
        <b>The Dormouse‘s story总共</b>
        <h1>f</h1>
    </div>
<div class="story">Once upon a time there were three little sisters; and their names were
    <a  class="sister0" id="link1">Els<span>f</span>ie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</div>
ad<br/>sf
<p class="story">...</p>
</body>
</html>
"""

# soup = BeautifulSoup(html_doc, features="lxml")
soup = BeautifulSoup(html_doc, features="html.parser") # 等同于上面
# 找到第一个a标签
tag1 = soup.find(name=‘a‘)
# 找到所有的a标签
tag2 = soup.find_all(name=‘a‘)
# 找到id＝link2的标签
tag3 = soup.select(‘#link2‘)

print(tag1)
print(tag2)
print(tag3)
# <a class="sister0" id="link1">Els<span>f</span>ie</a>
# [<a class="sister0" id="link1">Els<span>f</span>ie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

3、标签方法：

① name标签名称

# name
tag = soup.find(‘a‘)
# <a class="sister0" id="link1">Els<span>f</span>ie</a>
name = tag.name # 获取
print(name)  # a
tag.name = ‘span‘ # 设置 替换第一个a标签为span标签
print(soup)

② attrs标签属性

# attrs
tag = soup.find(‘a‘)
# <a class="sister0" id="link1">Els<span>f</span>ie</a>
attrs = tag.attrs          # 获取
print(attrs)
# {‘class‘: [‘sister0‘], ‘id‘: ‘link1‘}
tag.attrs = {‘ik‘:123}     # 设置属性
tag.attrs[‘id‘] = ‘iiiii‘ # 更改id
print(soup)

③ children所有子标签

# 子标签，只是获取儿子
tag = soup.find(‘a‘)
# <a class="sister0" id="link1">Els<span>f</span>ie</a>
v = tag.children
print(v)
# <list_iterator object at 0x02E71230>
for i in v:
    print(i)
# Els
# <span>f</span>
# ie

④ descendants子子孙孙标签

# 获得子子孙孙
body = soup.find(‘body‘)
v = body.descendants
print(v)
for i in v:
    print(i)

⑤ clear将标签的所有子标签全部清空（保留标签名）　　

# 清空表签内容
tag = soup.find(‘body‘)
tag.clear()
print(soup)

# <html><head><title>The Dormouse‘s story</title></head>
# <body></body>
# </html>

⑥ decompose,递归的删除所有的标签

# 递归的删除所有的标签
body = soup.find(‘body‘)
body.decompose()
print(soup)

# <html><head><title>The Dormouse‘s story</title></head>
#
# </html>

⑦ extract递归的删除所有的标签，并获取删除的标签

# 递归的删除所有的标签，并获取删除的标签
body = soup.find(‘body‘)
v = body.extract()
print(soup)
# <html><head><title>The Dormouse‘s story</title></head>
#
# </html>
print(v,type(v))
# <body>
# asdf
#     <div class="title">
# <b>The Dormouse‘s story总共</b>
# <h1>f</h1>
# </div>
# <div id="story">Once upon a time there were three little sisters; and their names were
#     <a class="sister0" id="link1">Els<span>f</span>ie</a>,
#     <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#     <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</div>
# ad<br/>sf
# <p class="story">...</p>
# </body> <class ‘bs4.element.Tag‘>

时间： 2024-11-03 22:10:30

Python开发【模块】：BeautifulSoup的相关文章

python开发模块基础：os&sys

一,os模块 os模块是与操作系统交互的一个接口 1 #!/usr/bin/env python 2 #_*_coding:utf-8_*_ 3 4 ''' 5 os.walk() 显示目录下所有文件和子目录以元祖的形式返回,第一个是目录,第二个是文件夹,第三个是文件 6 open(r'tmp\inner\file',w) 创建文件 7 os.getcwd() 获取当前工作目录,即当前python脚本工作的目录路径可以先记录当前文件目录 8 os.chdir("dirname") 改

python开发模块基础：re正则

一,re模块的用法 #findall #直接返回一个列表 #正常的正则表达式 #但是只会把分组里的显示出来#search #返回一个对象 .group()#match #返回一个对象 .group() 1 import re 2 #re模块的用法 3 4 ret = re.findall('a', 'eva egon yuan') # 返回所有满足匹配条件的结果,放在列表里 5 print(ret) #结果 : ['a', 'a'] 6 7 ret = re.search('a', 'eva e

python开发模块基础：序列化模块json,pickle,shelve

一,为什么要序列化 # 将原本的字典.列表等内容转换成一个字符串的过程就叫做序列化'''比如,我们在python代码中计算的一个数据需要给另外一段程序使用,那我们怎么给?现在我们能想到的方法就是存在文件里,然后另一个python程序再从文件里读出来.但是我们都知道,对于文件来说是没有字典这个概念的,所以我们只能将数据转换成字典放到文件中.你一定会问,将字典转换成一个字符串很简单,就是str(dic)就可以办到了,为什么我们还要学习序列化模块呢?没错序列化的过程就是从dic 变成str(dic)的

python开发模块基础：collections模块

一,collections模块在内置数据类型(dict.list.set.tuple)的基础上,collections模块还提供了几个额外的数据类型:Counter.deque.defaultdict.namedtuple和OrderedDict等.1.namedtuple: 生成可以使用名字来访问元素内容的tuple2.deque: 双端队列,可以快速的从另外一侧追加和推出对象3.Counter: 计数器,主要用来计数4.OrderedDict: 有序字典5.defaultdict: 带有默

python开发模块基础：正则表达式&re模块

一,正则表达式 1.字符组:[0-9][a-z][A-Z] 在同一个位置可能出现的各种字符组成了一个字符组,在正则表达式中用[]表示字符分为很多类,比如数字.字母.标点等等.假如你现在要求一个位置"只能出现一个数字",那么这个位置上的字符只能是0.1.2...9这10个数之一.可以写成这种 [0-5a-eA-Z] 取范围的匹配 2.字符 1 #!/usr/bin/python env 2 #_*_coding:utf-8_*_ 3 4 . 匹配除换行符以外的任意字符 5 \w 匹配字母

转《python开发_常用的python模块及安装方法》

http://www.cnblogs.com/hongten/p/hongten_python_more_modules.html adodb:我们领导推荐的数据库连接组件bsddb3:BerkeleyDB的连接组件Cheetah-1.0:我比较喜欢这个版本的cheetahcherrypy:一个WEB frameworkctypes:用来调用动态链接库DBUtils:数据库连接池django:一个WEB frameworkdocutils:用来写文档的dpkt:数据包的解包和组包MySQLdb:

Python爬虫之Beautifulsoup模块的使用

一 Beautifulsoup模块介绍 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.你可能在寻找 Beautiful Soup3 的文档,Beautiful Soup 3 目前已经停止开发,官网推荐在现在的项目中使用Beautiful Soup 4, 移植到BS4 #安装 Beautiful Soup pip instal

Window上python开发--4.Django的用户登录模块User

在搭建网站和web的应用程序时,用户的登录和管理是几乎是每个网站都必备的.今天主要从一个实例了解以下django本身自带的user模块.本文并不对user进行扩展. 主要使用原生的模块. 1.User模块基础: 在使用user 之前先import到自己的iew中.相当与我们自己写好的models.只不过这个是系统提供的models. from django.contrib.auth.models import User # 导入user模块 1.1User对象属性 User 对象属性:usern

Python开发安装的一些常用模块

一.BeautifulSoup模块关于BeautifulSoup模块主要用于规范化网页源代码,利用其一些特定的解析标签函数来分析网页,的得到一些特定的内容,用起来方便简单容易入门,但仍然有一些弊端,比如说对于网页中含有js代码的就不能有效读取与分析,所以常结合正则表达式来进行使用,效率特别好二.Scrapy + Selenium 模块这两个模块结合起来使用常用于解析javascript,安装与配置比较麻烦,具体安装过程及需要额外安装哪些文件自行百度三.MySQLdb模块此模块用于连接M

Python开发【第六篇】：模块

模块,用一砣代码实现了某个功能的代码集合. 类似于函数式编程和面向过程编程,函数式编程则完成一个功能,其他代码用来调用即可,提供了代码的重用性和代码间的耦合.而对于一个复杂的功能来,可能需要多个函数才能完成(函数又可以在不同的.py文件中),n个 .py 文件组成的代码集合就称为模块. 如:os 是系统相关的模块:file是文件操作相关的模块模块分为三种: 自定义模块第三方模块内置模块自定义模块 1.定义模块情景一: 情景二: 情景三: 2.导入模块 Python之所以应用越来越广泛,