Python中xPath技术和BeautifulSoup的使用

xpath基本知识

XPath语法：使用路径表达式来选取XML或HTML文档中的节点或节点集

路径表达式

nodename:表示选取此节点的所有子节点

/ ：表示从根节点选取

// ：选择任意位置的某个节点。

. ：选取当前节点

.. ：选取当前节点的父节点

@ ：选取属性

谓语实例

实现效果路劲表达式

选取属于classroom子元素的第一个student元素 /classroom/student[1]

选取属于classroom子元素的最后一个student元素 /classroom/student[last()]

选取属于classroom子元素的倒数第二个stduent元素 /classroom/stduent[last()-1]

选取最前面的两个属于classroom元素的子元素的student元素 /classroom/stduent[position()<3]

选取所有拥有名为lang的属性的name元素 //name[@lang]

选取所有name元素，且这些元素拥有值为eng的lang属性 //name[@lang=‘en‘]

选取classroom元素的所有student元素，且其中的age元素的值须大于20 .classroom.stduent[age>20]

选取classroom元素中的student元素的所有name元素，且其中的age元素的值须大于20 /classroom/stduent[age>20]/name

通配符“*”与“|”操作

实现效果路径表达式

选取classroom元素的所有子元素 /classroom/*

选取文档中的所有元素 //*

选取所有带有属性的name元素 //name[@*]

选取stduent元素的所有name和age元素 //stduent/name | //stduent/age

选取属于classroom元素的student元素的所有name元素，以及文档中所有的age元素 /classroom/stduent/name | //age

XPath轴步的语法为轴名称:节点测试[谓语]

轴名称含义

child 选取当前节点的所有子节点

parent 选取当前节点的父节点

ancestor 选取当前节点的所有先辈(父、祖父等)

ancestor-or-self 选取当前节点的所有先辈以及当前节点本身

descendant 选取当前节点的所有后代节点

descendant-or-self 选取当前节点的所有后代节点以及当前节点本身

preceding 选取文档中当前节点的开始标记之前的所有节点

following 选取文档中当前节点的结束标记之后的所有节点

preceding-sibling 选取当前节点之前的所有同级节点

following-sibling 选取当前节点之后的所用同级节点

self 选取当前节点

attribute 选取当前节点的所有属性

namespace 选取当前节点的所有命名空间

XPath轴示例分析

实现效果路径表达式

选取当前classroom节点中子元素的teacher节点 /classroom/child：：teacher

选取所有id节点的父节点 //id/parent：：*

选取所有以classid为子节点的祖先节点 //classid/ancestor：：*

选取classroom节点下的所有后代节点 /classroom/descendant：：*

选取所有以student为父节点的id元素 //student/descendant：：id

选取所有classid元素的祖先节点及本身 //classid/ancestor-or-self：：*

选择/classroom/student本身及其所有后代元素 /classroom/student/descendant-or-self：：*

选取/classroom/teacher之前的所有同级节点，结果就是选所有的student节点 /classroom/teacher/preceding-sibling：：*

选取/classroom中第二个stduent之后的所有同级节点 /classroom/student[2]/following-sibling：：*

选取/classroom/teacher节点所有之前的节点(除其祖先外)，不仅仅是student节点，还有里面的子节点 /classroom/teacher/preceding：：*

选取/classroom中第二个student之后的所有节点，结果就是选择了teacher节点及其子节点 /classroom/student[2]/following：：*

选取student节点，单独使用没有什么意思 //stduent/self：：*

选取/classroom/teacher/name节点下的所有属性 /classroom/teacher/name/attribute：：*

XPath运算符示例分析

含义实例

选取classroom元素的所有student元素 /classroom/student[age=19+1] /classroom/stduent[age=5*4] /classroom/student[age=21-1]

且其中的age元素的值须等于20 /classroom/student[age=40div2]

类似可以选取大于、小于、不等于等操作

or 运算实例 /classroom/stduent[age<20 or age>25] .................age小于20或者大于25

and 运算实例 /classroom/stduent[age>20 and age<25] ..................age在20 到25 之间

mod 计算除法的余数

实例代码

from lxml import etree

contentStream = open(r‘xpathText.xml‘, ‘rb‘)
content = contentStream.read().decode(‘utf-8‘)
root = etree.XML(content)
print(content)
print(‘-------‘)
em = root.xpath(‘/classroom/student[2]/following::*‘)
print(em[0].xpath(‘./name/text()‘))#获取name标签中文本的内容
print(em[0].xpath(‘./name/@lang‘)) #获取name标签中属性名为lang的属性值

BeautifulSoup基础知识

创建BeautifulSoup对象的两种方式 1.通过字符串创建 soup=BeautifulSoup(htl_str,‘lxml‘) 其中‘lxml‘表示指定的解析方式

2.通过文件创建 soup=BeautifulSoup(open(‘index.html‘))

对象种类四种 Tag、NavigableString、BeautifulSoup 、Comment

1）Tag

在html中每个标签及其里面的内容就是一个Tag对象,如何抽取Tag呢？

soup.title抽取title soup.a 抽取a 利用soup+标记名查找的是再内容中第一个符合要求的标记

Tag中有两个最重要的属性：name和attributes.每个Tag都有自己的名字，通过.name来获取

修改Tag的name,修改完成后将影响所有通过当前Beautiful Soup对象生成的HTML文档

html_str = """<html><head><title>The Dormouse‘s story</title></head><body><p class="title"><b>The Dormouse‘s story</b></p><p class="story">Once upon a time there were three little sisters; and their names were    <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>    <a href="http://example.com/lacie" class="sister" id="link2">        <!--Lacie -->    </a>    and    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;    and they lived at the bottom of a well.</p><p class="story">……</p></body></html>"""soup = BeautifulSoup(html_str, ‘lxml‘)
# soup = BeautifulSoup(open(r‘index.html‘,‘rb‘),‘lxml‘)
print(soup.prettify())  #以格式化的形式输出文档的内容
print(soup.name)
print(soup.title.name)#输出title的名称
soup.title.name = ‘mytitle‘  #修改title的名称为mytitle
print(soup.title)    #title已经修改输出None
print(soup.mytitle)#输出mytitle  Tag

输出结果

整个文档的内容
[document]
title
None
<mytitle>The Dormouse‘s story</mytitle>

获取Tag属性？The Dormouse‘s storyTag p中有一个属性class值为title，获取代码如下：

Tag属性值的修改类似于上述标签名的修改 soup.p[‘class‘]=‘myclass‘ 就把属性值title改为了myclass

# 获取Tag中的属性  和字典类似
print(soup.p[‘class‘])
print(soup.p.get(‘class‘))

输出结果

[‘title‘]
[‘title‘]

用于获取Tag所有属性的方法 print(soup.p.attrs)以字典的行书获取指定Tag的所有属性：属性值字典

输出格式如下

{‘class‘: [‘title‘]}

2)NavigableString 当已经得到了标记的内容，要想获取标记的内部文字怎么办呢？需要用到.string。

print(soup.b.string)#输出Tag对象b的内容
print(type(soup.b.string))#输出Tage对象b的内容的类型  其实就是NavigableString类型

输出结果

The Dormouse‘s story
<class ‘bs4.element.NavigableString‘>

3）Beautiful Soup

Beautiful Soup对象表示的是一个文档的全部内容。大部分时候，可以把它当作Tag对象，是一个特殊的人Tag,实例如下

print(type(soup.name))
print(soup.name)
print(soup.attrs)

输出结果

<class ‘str‘>
[document]
{}

4) Comment 文档的注释部分，示例如下

print(soup.a.string)
print(type(soup.a.string))

输出结果

Elsie
<class ‘bs4.element.Comment‘>

遍历文档

1)子节点

Tag中的.contents和.children是非常重要的，都是输出直接子节点，Tag的contents属性可以将Tag子节点以列表的方式输出：

print(soup.html.contents)
print(soup.html.contents[1])#如果soup.html.contents[1].string会直接输出文档里的内容，具体解释看下面

输出结果

[‘\n‘, <head><mytitle>The Dormouse‘s story</mytitle></head>, ‘\n‘, <body>
<p class="title"><b>The Dormouse‘s story</b></p><p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<a class="sister" href="http://example.com/lacie" id="link2">
<!--Lacie -->
</a>
    and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p><p class="story">……</p>
</body>, ‘\n‘]
<head><mytitle>The Dormouse‘s story</mytitle></head>

Tag中children，其实.children返回的是一个生成器，可以对Tag的子节点进行循环

for child in soup.html.children:  # 孩子结点递归循环
    print(child)

输出结果：对于输出换行时，他要空两行，因为print自带换行

<head><mytitle>The Dormouse‘s story</mytitle></head>

<body>
<p class="title"><b>The Dormouse‘s story</b></p><p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<a class="sister" href="http://example.com/lacie" id="link2">
<!--Lacie -->
</a>
    and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p><p class="story">……</p>
</body>

.descendants属性可以对所有tag的子孙节点进行递归循环：head中只有一个直接2节点title,但title也包含一个子节点：字符串‘The Dormouse‘s story‘,

在这种情况下，字符串也属于<head>标记的子孙节点，

for child in soup.head.descendants:  # 子孙节点递归循环
    print(child)

输出结果

<mytitle>The Dormouse‘s story</mytitle>
The Dormouse‘s story

如何获取标记的内容呢？？？这就涉及.string、.strings、stripped_strings三个属性

.string这个属性很有特点：如果一个标记里面没有标记了，那么.string就会返回标记里面的内容。如果标记里面只有唯一

的一个标记了，那么.string也会返回最里面的内容。如果tag包含多个子节点，tag就无法确定，string方法应该调用哪个子节点的内容，.string的输出结果是None

print(soup.head.string)
print(soup.mytitle.string)
print(soup.html.string)

输出结果

The Dormouse‘s story
The Dormouse‘s story
None

.strings属性主要应用于tag中包含多个字符串的情况，可以进行循环遍历

for stri in soup.strings:
    print(repr(stri))

输出结果

‘\n‘
"The Dormouse‘s story"
‘\n‘
‘\n‘
"The Dormouse‘s story"
‘Once upon a time there were three little sisters; and their names were\n    ‘
‘\n‘
‘\n‘
‘\n‘
‘\n    and\n    ‘
‘Tillie‘
‘;\n    and they lived at the bottom of a well.\n‘
‘……‘
‘\n‘
‘\n‘

.stripped_strings属性可以去掉输出字符串中包含的空格或换行，示例如下

for stri in soup.stripped_strings:
    print(repr(stri))

输出结果

"The Dormouse‘s story"
"The Dormouse‘s story"
‘Once upon a time there were three little sisters; and their names were‘
‘and‘
‘Tillie‘
‘;\n    and they lived at the bottom of a well.‘
‘……‘

2)父节点

每个Tag或者字符串都有父节点：被包含在某个Tag中。通过.parent可以获取某个元素的父节点

print soup.mytitle.parent 输出<head><title>........</title></head>

通过元素的.parents属性可以递归得到元素所有父辈节点，使用.parents方法遍历了<a>标记到根节点的所有节点

print(soup.a)
for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

输出结果

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
p
body
html
[document]

3)兄弟节点:可以理解为和本节点出在同一级上的节点，.next_sibling属性可以获取该节点的下一个兄弟节点，.previous_sibling则与之相反，

如果节点不存在，则返回None

可以通过.next_siblings和.previous_siblings来迭代所有的兄弟节点　

4)前后节点

前后节点需要使用.next_element、previous_element这两个属性,他针对所有节点，不分层次，例如<head><title>The Dormouse‘s story</title></head>

中的下一个节点是title

如果想遍历所有的前节点或者后节点，通过.next_elements和previous_elements的迭代器就可以向前或向后访问文档的解析内容

for elem in soup.html.next_elements:  #有点像深度优先遍历
    print(repr(elem))

输出结果

‘\n‘
<head><mytitle>The Dormouse‘s story</mytitle></head>
<mytitle>The Dormouse‘s story</mytitle>
"The Dormouse‘s story"
‘\n‘
<body>
<p class="title"><b>The Dormouse‘s story</b></p><p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<a class="sister" href="http://example.com/lacie" id="link2">
<!--Lacie -->
</a>
    and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p><p class="story">……</p>
</body>
‘\n‘
<p class="title"><b>The Dormouse‘s story</b></p>
<b>The Dormouse‘s story</b>
"The Dormouse‘s story"
<p class="story">Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
<a class="sister" href="http://example.com/lacie" id="link2">
<!--Lacie -->
</a>
    and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
</p>
‘Once upon a time there were three little sisters; and their names were\n    ‘
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
‘ Elsie ‘
‘\n‘
<a class="sister" href="http://example.com/lacie" id="link2">
<!--Lacie -->
</a>
‘\n‘
‘Lacie ‘
‘\n‘
‘\n    and\n    ‘
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
‘Tillie‘
‘;\n    and they lived at the bottom of a well.\n‘
<p class="story">……</p>
‘……‘
‘\n‘
‘\n‘

搜索文档

只介绍find_all()方法，其它方法类似

函数原型

find_all(name，attrs，recursive，text，**kwargs)

1)name参数

name参数可以查找所有名字为name的标记，字符对象会被自动忽略掉。name参数取值可以是字符串、正则表达式、列表、True和方法

字符串案例用于查找文档中所有的标记，返回值为列表：

print(soup.find_all(‘b‘))
#输出结果
[<b>The Dormouse‘s story</b>]

传入正则表达式作为参数，会通过正则表达式的match()来匹配内容。下面列出所有以b开头的标记，这表示<body>和标记

for tag in soup.find_all(re.compile(‘^b‘)):
    print(tag.name)
#输出结果
body
b

传入列表

print(soup.find_all([‘a‘,‘b‘]))//找到文档中所有的<a>标记和标记

传入True,True可以匹配任何值，会查找所有的tag ,但不会返回字符串节点

for tag in soup.find_all(True):
    print(tag.name)
#输出结果
html
head
mytitle
body
p
b
p
a
a
a
p

如果没有合适过滤器，那么还可以定义一个方法，方法只接受一个元素参数Tag节点，如果这个方法返回？True表示当前元素匹配并且被找到

，如果不是则返回False,比如过滤包含class属性，也包含id属性的元素

def hasClass_Id(tag):
    return tag.has_attr(‘class‘) and tag.has_attr(‘id‘)
print(soup.find_all(hasClass_Id))
#输出结果
[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">
<!--Lacie -->
</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

2)kwargs参数

kwargs参数就是python中的keyword参数，如果一个指定名字的参数不是搜索内置的参数名，搜索时会把该参数当作指定名字Tag的属性来搜索

。搜索指定名字的属性时可以使用的参数值包括字符串、正则表达式、列表、True

传入字符串 print(soup.find_all(id=‘link2‘)) 会搜索每个tag的id属性

传入正则表达式 print（soup.find_all(href=re.compile(‘elsie‘))）搜索href属性中含有‘elsie’的tag

True print(soup.find_all(id=True)) 文档树中查找所有包含id属性的Tag,无论id的值是什么：

如果想用 class过滤·，但class是python的关键字，需要在class后main加个下划线:

soup.find_all(‘a‘,class_=‘sister‘)

有些tag属性在搜索中不能使用，比如HTML5中的data-*属性可以通过find_all()方法的attrs参数定义一个字典参数来搜索包含特殊属性的tag

，

data_soup = BeautifulSoup(‘<div data-foo="value">foo!</div>‘, ‘lxml‘)
print(data_soup.find_all(attrs={"data-foo": "value"}))
# data_soup.find_all(data - foo = ‘value‘)  #报错 特殊属性不能这样处理
#输出结果
[<div data-foo="value">foo!</div>]

3)text参数

通过text参数可以搜索文档中的字符串内容。与name参数的可选值一样，text参数接受字符、正则表达式、列表、True

print soup.find_all(text=["Tillie", "Elsie", "Lacie"])
print soup.find_all(text=re.compile("Dormouse"))输出结果

[u‘Elsie‘, u‘Lacie‘, u‘Tillie‘]
[u"The Dormouse‘s story", u"The Dormouse‘s story"]

4)limit参数

find_all()方法返回全部的搜索结构，如果文档树很大那么搜索会很慢2.如果我们不需要全部结果，可以使用limit参数限制返回结果的数量

soup.find_all(‘a‘,limit=2)值返回两条结果

5)recursive参数

调用tag的find_all()方法是，Beautiful Soup会检索当前tag的所有子孙节点，如果只想检索tag的直接子节点，可以使用参数

recusive=False

print(soup.find_all(‘mytitle‘))
print(soup.find_all(‘mytitle‘, recursive=False))
#输出结果
[<mytitle>The Dormouse‘s story</mytitle>]
[]

原文地址：https://www.cnblogs.com/mnzht/p/9009753.html

时间： 2024-08-01 14:55:28

Python中xPath技术和BeautifulSoup的使用的相关文章

python中xpath的基本使用

写在前面的话 :上一篇文章我们利用requests进行了一些爬虫小实验,但是想要更顺利的深入爬虫学习,了解一些解析网页的方法肯定是必须的,所以接下来我们就一起来学习lxml.etree模块的基础使用方法吧温馨提示 :博主使用的系统为win10,使用的python版本为3.6.5 一.XPATH简介若想了解xpath,我们首先需要知道什么是xml文档,其实简单地说,xml文档就是由一系列节点构成的树,例如 <html> <body> <div> He

关于爬虫中常见的两个网页解析工具的分析 —— lxml / xpath 与 bs4 / BeautifulSoup

读者可能会奇怪我标题怎么理成这个鬼样子,主要是单单写 lxml 与 bs4 这两个 py 模块名可能并不能一下引起大众的注意,一般讲到网页解析技术,提到的关键词更多的是 BeautifulSoup 和 xpath ,而它们各自所在的模块(python 中是叫做模块,但其他平台下更多地是称作库),很少被拿到明面上来谈论.下面我将从效率.复杂度等多个角度来对比 xpath 与 beautifulsoup 的区别. 效率从效率上来讲,xpath 确实比 BeautifulSoup 高效得多,每次分步

django 中的延迟加载技术，python中的lazy技术

---恢复内容开始--- 说起lazy_object,首先想到的是django orm中的query_set.fn.Stream这两个类. query_set只在需要数据库中的数据的时候才产生db hits.Stream对象只有在用到index时才会去一次次next. 例子: f = Stream() fib = f << [0, 1] << iters.map(add, f, iters.drop(1, f)) 1行生成了斐波那契数列. 说明: f是个lazy的对象,f首先放入

在C#中实现Python的切割技术

在C#中实现Python的切割技术前言之前在学习Python的时候发现Python中的切割技术超好玩的,本人也是正则表达式热爱狂,平时用C#比较多,所以决定把Python中的切割技术在C#中实现,添加到个人类库中,以便日后在写C#代码的时候能舔一舔Python的味道. 效果展示 Python版: C#版: 切割技术讲解这里先简要讲解一下Python中的切割技术,其他Python前辈也对此技术有丰富多彩的讲解文章,这里只是简要说明一下,好让读者们能知道下怎么回事,如果想更深入了解Pyt

Python中的循环技术

简单谈谈 Python 中容器的遍历和一下小技巧. 1.遍历单个容器下面代码遍历一个 List 结构,同样适用于 Tuple.Set 结构类型 >>> x = [1, 2, 3, 'p' , 'y'] >>> for v in x: ... print(x) ... 1 2 3 p y 遍历字典 Dict 结构也是一样的方法,注意区分 Key-Value >>> y = {'a':11, 'b':22} >>> y {'b': 22

Python中的三个骚操作和黑魔法技术，装逼必备

本文主要介绍Python的高级特性:列表推导式.迭代器和生成器,是面试中经常会被问到的特性.因为生成器实现了迭代器协议,可由列表推导式来生成,所有,这三个概念作为一章来介绍,是最便于大家理解的,现在看不懂没关系,下面我不仅是会让大家知其然,重要的更是要知其所以然. 列表推导式前几天有个HR让我谈谈列表推导式,我说这我经常用,就是用旧的列表生成一个新的列表的公式,他直接就把我拒了,让我回去复习一下,挺受打击的,所以决定也帮助大家回顾一下. 内容列表推导式:旧的列表->新的列表了解:字典推导式

Python 中的进程、线程、协程、同步、异步、回调（一）

一.上下文切换技术简述在进一步之前,让我们先回顾一下各种上下文切换技术. 不过首先说明一点术语.当我们说"上下文"的时候,指的是程序在执行中的一个状态.通常我们会用调用栈来表示这个状态--栈记载了每个调用层级执行到哪里,还有执行时的环境情况等所有有关的信息. 当我们说"上下文切换"的时候,表达的是一种从一个上下文切换到另一个上下文执行的技术.而"调度"指的是决定哪个上下文可以获得接下去的CPU时间的方法. 进程进程是一种古老而典型的上下文系

深刻理解Python中的元类(metaclass)

译注:这是一篇在Stack overflow上很热的帖子.提问者自称已经掌握了有关Python OOP编程中的各种概念,但始终觉得元类(metaclass)难以理解.他知道这肯定和自省有关,但仍然觉得不太明白,希望大家可以给出一些实际的例子和代码片段以帮助理解,以及在什么情况下需要进行元编程.于是e-satis同学给出了神一般的回复,该回复获得了985点的赞同点数,更有人评论说这段回复应该加入到Python的官方文档中去.而e-satis同学本人在Stack Overflow中的声望积分也高达6

xPath技术

1.xPath的作用:快速获取所需要的节点对象. 2.在Dom4j中如何使用xPath技术? (1)导入xPath支持的jar包. jaxen-1.1-beta-6.jar (2)使用方法 List<Node> selectNodes("xPath表达式");//查询多个节点对象 Node selectSingleNode("xPath表达式");//查询一个节点对象 3.语法 / 绝对路径表示从xml的根位置开始或子元素(一个层次