python爬虫---beautifulsoup（1）

　　beautifulsoup是用于对爬下来的内容进行解析的工具，其find和find_all方法都很有用。并且按照其解析完之后，会形成树状结构，对于网页形成了类似于json格式的key - value这种样子，更容易并且更方便对于网页的内容进行操作。

　　下载库就不用多说，使用python的pip，直接在cmd里面执行pip install beautifulsoup即可

　　首先仿照其文档说明，讲代码拷贝过来，如下

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse‘s story</title></head>
<body>
<p class="title"><b>The Dormouse‘s story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc,‘html.parser‘)

print soup.find_all(‘a‘)

　　html_doc即是我们爬下来的东西，这里方便直接使用了文档里面提供的内容。

　　我们直接对html_doc执行解析，使用的是html.parser这个解析器。

　　在sublime敲完之后ctrl+B即可运行（推荐下载python的SublimePythonIDE这个插件包，可以直接编译无需使用cmd）

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[Finished in 0.2s]

　　代码执行结果如上，将带有a的行数执行出来了。

　　我们按照文档要求改写一下，改写soup的内容，并且答应出结果。（直接黏贴官网内容，不在重复）

soup.title
# <title>The Dormouse‘s story</title>

soup.title.name
# u‘title‘

soup.title.string
# u‘The Dormouse‘s story‘

soup.title.parent.name
# u‘head‘

soup.p
# <p class="title"><b>The Dormouse‘s story</b></p>

soup.p[‘class‘]
# u‘title‘

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all(‘a‘)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

　　如上，可以很明显的看出来，解析完毕的soup，形成了key-value格式的数据，使用soup.title等方法可以分别打印出需要的内容。（#为打出内容）

　　还有其他的一些方法。

for link in soup.find_all(‘a‘):
    print(link.get(‘href‘))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

　　使用foreach即可很轻松的对于复杂父容器的子控件进行操作。（#为打出内容）

　　官网最后一个内容是将该网页的所有的内容去掉符号直接显示内容。方法如下

print(soup.get_text())
# The Dormouse‘s story
#
# The Dormouse‘s story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...

　　也很方便的直接把文本的内容打出来了。

　　以上为beautifulsoup的比较简单的使用。

时间： 2024-10-14 00:42:34

python爬虫---beautifulsoup（1）的相关文章

[python爬虫] BeautifulSoup和Selenium对比爬取豆瓣Top250电影信息

这篇文章主要对比BeautifulSoup和Selenium爬取豆瓣Top250电影信息,两种方法从本质上都是一样的,都是通过分析网页的DOM树结构进行元素定位,再定向爬取具体的电影信息,通过代码的对比,你可以进一步加深Python爬虫的印象.同时,文章给出了我以前关于爬虫的基础知识介绍,方便新手进行学习. 总之,希望文章对你有所帮助,如果存在不错或者错误的地方,还请海涵~ 一. DOM树结构分析豆瓣Top250电影网址:https://movie.douban.com/top2

python爬虫---beautifulsoup（2）

之前我们使用的是python的自带的解析器html.parser.官网上面还有一些其余的解析器,我们分别学习一下. 解析器使用方法优点缺点 htm.parser BeautifulSoup(markup,'html.parser') 1.python自带的 2.解析速度过得去 3.容错强 2.7之前的版本,和3.3之前不包括2.7的都不支持 lxml`s HTML parser BeautifulSoup(markup,'lxml') 1.非常快 2.容错强要安装C语言库 lxml`s

Python 爬虫-BeautifulSoup

2017-07-26 10:10:11 Beautiful Soup可以解析html 和 xml 格式的文件. Beautiful Soup库是解析.遍历.维护"标签树"的功能库.使用BeautifulSoup库非常简单,只需要两行代码,就可以完成BeautifulSoup类的创建,这里命名为soup,接下来就可以对soup进行相关处理了.一个BeautifulSoup类对应html或者xml的全部内容. BeautifulSoup库将任意html文件转换成utf-8格式一.解析器

python爬虫beautifulsoup

1.BeautifulSoup库,也叫beautifulsoup4或bs4 功能:解析HTML/XML文档 2.HTML格式成对尖括号构成 3.库引用 #bs4为简写,BeautifulSoup为其中一个类 from bs4 import BeautifulSoup #直接引用库 import bs4 3.1.BeautifulSoup类 >>from bs4 import BeautifulSoup >>soup=BeautifulSoup("<html>

python爬虫：使用urllib.request和BeautifulSoup抓取新浪新闻标题、链接和主要内容

案例一抓取对象: 新浪国内新闻(http://news.sina.com.cn/china/),该列表中的标题名称.时间.链接. 完整代码: from bs4 import BeautifulSoup import requests url = 'http://news.sina.com.cn/china/' web_data = requests.get(url) web_data.encoding = 'utf-8' soup = BeautifulSoup(web_data.text,'

Python爬虫：用BeautifulSoup进行NBA数据爬取

爬虫主要就是要过滤掉网页中无用的信息,抓取网页中有用的信息一般的爬虫架构为: 在python爬虫之前先要对网页的结构知识有一定的了解,如网页的标签,网页的语言等知识,推荐去W3School: W3school链接进行了解在进行爬虫之前还要有一些工具: 1.首先Python 的开发环境:这里我选择了python2.7,开发的IDE为了安装调试方便选择了用VS2013上的python插件,在VS上进行开发(python程序的调试与c的调试差不多较为熟悉): 2.网页源代码的查看工具:虽然每一个浏

python爬虫主要就是五个模块：爬虫启动入口模块，URL管理器存放已经爬虫的URL和待爬虫URL列表，html下载器，html解析器，html输出器同时可以掌握到urllib2的使用、bs4（BeautifulSoup）页面解析器、re正则表达式、urlparse、python基础知识回顾（set集合操作）等相关内容。

本次python爬虫百步百科,里面详细分析了爬虫的步骤,对每一步代码都有详细的注释说明,可通过本案例掌握python爬虫的特点: 1.爬虫调度入口(crawler_main.py) # coding:utf-8from com.wenhy.crawler_baidu_baike import url_manager, html_downloader, html_parser, html_outputer print "爬虫百度百科调度入口" # 创建爬虫类class SpiderMai

python爬虫实例（urllib&BeautifulSoup）

python 2.7.6 urllib:发送报文并得到response BeautifulSoup:解析报文的body(html) #encoding=UTF-8 from bs4 import BeautifulSoup from urllib import urlopen import urllib list_no_results=[]#没查到的银行卡的list list_yes_results=[]#已查到的银行卡的list #解析报文,以字典存储 def parseData(htmls,

Python 爬虫—— requests BeautifulSoup

本文记录下用来爬虫主要使用的两个库.第一个是requests,用这个库能很方便的下载网页,不用标准库里面各种urllib:第二个BeautifulSoup用来解析网页,不然自己用正则的话很烦. requests使用,1直接使用库内提供的get.post等函数,在比简单的情况下使用,2利用session,session能保存cookiees信息,方便的自定义request header,可以进行登陆操作. BeautifulSoup使用,先将requests得到的html生成BeautifulSo