爬虫--BeautifulSoup

什么是BeautifulSoup?

BeautifulSoup支持的一些解析库

基本使用

from bs4 import BeautifulSoup

html ="""
    <html><head><title> The Dormouse‘s story</title></head>
    <body>
    <p class="title" name="dromouse"> <b> The Dormouse‘s story</b></p>
    <p class="story">Once upon a time there were three little sisters;and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"> <!--Elsie--></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">..</p>
"""

soup=BeautifulSoup(html,"lxml")
print(soup.prettify())  # .prettify() 格式化代码
print(soup.title.string) # .title.string

<html>
 <head>
  <title>
   The Dormouse‘s story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse‘s story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters;and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!--Elsie-->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
    and they lived at the bottom of a well.
  </p>
  <p class="story">
   ..
  </p>
 </body>
</html>
 The Dormouse‘s story

打印后的结果为:

标签选择器

选择元素

from bs4 import BeautifulSoup

html ="""
    <html><head><title> The Dormouse‘s story</title></head>
    <body>
    <p class="title" name="dromouse"> <b> The Dormouse‘s story</b></p>
    <p class="story">Once upon a time there were three little sisters;and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"> <!--Elsie--></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">..</p>
"""

soup=BeautifulSoup(html,"lxml")
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)

<title> The Dormouse‘s story</title>
<class ‘bs4.element.Tag‘>
<head><title> The Dormouse‘s story</title></head>
<p class="title" name="dromouse"> <b> The Dormouse‘s story</b></p>

打印的结果为:

获取名称

from bs4 import BeautifulSoup

html ="""
    <html><head><title> The Dormouse‘s story</title></head>
    <body>
    <p class="title" name="dromouse"> <b> The Dormouse‘s story</b></p>
    <p class="story">Once upon a time there were three little sisters;and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"> <!--Elsie--></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">..</p>
"""

soup=BeautifulSoup(html,"lxml")
print(soup.title.name)

title

打印的结果为:

获取属性

from bs4 import BeautifulSoup

html ="""
    <html><head><title> The Dormouse‘s story</title></head>
    <body>
    <p class="title" name="dromouse"> <b> The Dormouse‘s story</b></p>
    <p class="story">Once upon a time there were three little sisters;and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"> <!--Elsie--></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">..</p>
"""

soup=BeautifulSoup(html,"lxml")
print(soup.p.attrs["name"])
print(soup.p["name"])

dromouse
dromouse

打印的结果为:

获取内容

from bs4 import BeautifulSoup

html ="""
    <html><head><title> The Dormouse‘s story</title></head>
    <body>
    <p class="title" name="dromouse"> <b> The Dormouse‘s story</b></p>
    <p class="story">Once upon a time there were three little sisters;and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"> <!--Elsie--></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">..</p>
"""

soup=BeautifulSoup(html,"lxml")
print(soup.title.string)

 The Dormouse‘s story

打印后的结果为:

嵌套选择

from bs4 import BeautifulSoup

html ="""
    <html><head><title> The Dormouse‘s story</title></head>
    <body>
    <p class="title" name="dromouse"> <b> The Dormouse‘s story</b></p>
    <p class="story">Once upon a time there were three little sisters;and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"> <!--Elsie--></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">..</p>
"""

soup=BeautifulSoup(html,"lxml")
print(soup.head.title.string)

 The Dormouse‘s story

打印后的结果为:

子节点和子孙节点

from bs4 import BeautifulSoup

html ="""
    <html>
        <head>
            <title> The Dormouse‘s story</title>
        </head>
    <body>
    <p class="story">
        Once upon a time there were three little sisters;and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">
            <!--Elsie-->
        </a>,
        <a href="http://example.com/lacie" class="sister" id="link2">
            Lacie
        </a>
        and
        <a href="http://example.com/tillie" class="sister" id="link3">
            Tillie
        </a>;
        and
        they lived at the bottom of a well.
    </p>
    <p class="story">
        ..
    </p>
"""

soup=BeautifulSoup(html,"lxml")
print(soup.p.contents)

[‘\n        Once upon a time there were three little sisters;and their names were\n        ‘, <a class="sister" href="http://example.com/elsie" id="link1">
<!--Elsie-->
</a>, ‘,\n        ‘, <a class="sister" href="http://example.com/lacie" id="link2">
            Lacie
        </a>, ‘\n        and\n        ‘, <a class="sister" href="http://example.com/tillie" id="link3">
            Tillie
        </a>, ‘;\n        and \n        they lived at the bottom of a well.\n    ‘]

打印后的结果为:

from bs4 import BeautifulSoup

html ="""
    <html>
        <head>
            <title> The Dormouse‘s story</title>
        </head>
    <body>
    <p class="story">
        Once upon a time there were three little sisters;and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">
            <!--Elsie-->
        </a>,
        <a href="http://example.com/lacie" class="sister" id="link2">
            Lacie
        </a>
        and
        <a href="http://example.com/tillie" class="sister" id="link3">
            Tillie
        </a>;
        and
        they lived at the bottom of a well.
    </p>
    <p class="story">
        ..
    </p>
"""

soup=BeautifulSoup(html,"lxml")
print(soup.p.children) # .children就相当于迭代器,需要循环的方式才能把内容取走
for i,child in enumerate(soup.p.children):
    print(i,child)

<list_iterator object at 0x00000000012C5E10>
0
        Once upon a time there were three little sisters;and their names were

1 <a class="sister" href="http://example.com/elsie" id="link1">
<!--Elsie-->
</a>
2 ,

3 <a class="sister" href="http://example.com/lacie" id="link2">
            Lacie
        </a>
4
        and

5 <a class="sister" href="http://example.com/tillie" id="link3">
            Tillie
        </a>
6 ;
        and
        they lived at the bottom of a well.

打印后的结果为:

from bs4 import BeautifulSoup

html ="""
    <html>
        <head>
            <title> The Dormouse‘s story</title>
        </head>
    <body>
    <p class="story">
        Once upon a time there were three little sisters;and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">
            <!--Elsie-->
        </a>,
        <a href="http://example.com/lacie" class="sister" id="link2">
            Lacie
        </a>
        and
        <a href="http://example.com/tillie" class="sister" id="link3">
            Tillie
        </a>;
        and
        they lived at the bottom of a well.
    </p>
    <p class="story">
        ..
    </p>
"""

soup=BeautifulSoup(html,"lxml")
print(soup.p.descendants) # .descendants就相当于迭代器,获取所有的子孙节点,需要循环的方式才能把内容取走
for i,child in enumerate(soup.p.descendants):
    print(i,child)

Once upon a time there were three little sisters;and their names were

1 <a class="sister" href="http://example.com/elsie" id="link1">
<!--Elsie-->
</a>
2
3 Elsie
4
5 ,
6 <a class="sister" href="http://example.com/lacie" id="link2">
            Lacie
        </a>
7
            Lacie
8
        and
9 <a class="sister" href="http://example.com/tillie" id="link3">
            Tillie
        </a>
10
            Tillie
11 ;
        and
        they lived at the bottom of a well.

打印后的结果为:

父节点和祖先节点

from bs4 import BeautifulSoup

html ="""
    <html>
        <head>
            <title> The Dormouse‘s story</title>
        </head>
    <body>
    <p class="story">
        Once upon a time there were three little sisters;and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">
            <!--Elsie-->
        </a>,
        <a href="http://example.com/lacie" class="sister" id="link2">
            Lacie
        </a>
        and
        <a href="http://example.com/tillie" class="sister" id="link3">
            Tillie
        </a>;
        and
        they lived at the bottom of a well.
    </p>
    <p class="story">
        ..
    </p>
"""

soup=BeautifulSoup(html,"lxml")
print(soup.a.parent) # .descendants就相当于迭代器,获取所有的父节点

<p class="story">
        Once upon a time there were three little sisters;and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
<!--Elsie-->
</a>,
        <a class="sister" href="http://example.com/lacie" id="link2">
            Lacie
        </a>
        and
        <a class="sister" href="http://example.com/tillie" id="link3">
            Tillie
        </a>;
        and
        they lived at the bottom of a well.
    </p>

打印后的结果为:

from bs4 import BeautifulSoup

html ="""
    <html>
        <head>
            <title> The Dormouse‘s story</title>
        </head>
    <body>
    <p class="story">
        Once upon a time there were three little sisters;and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">
            <!--Elsie-->
        </a>,
        <a href="http://example.com/lacie" class="sister" id="link2">
            Lacie
        </a>
        and
        <a href="http://example.com/tillie" class="sister" id="link3">
            Tillie
        </a>;
        and
        they lived at the bottom of a well.
    </p>
    <p class="story">
        ..
    </p>
"""

soup=BeautifulSoup(html,"lxml")
print(list(enumerate(soup.a.parents))) # .descendants就相当于迭代器,获取所有的祖先节点)

[(0, <p class="story">
        Once upon a time there were three little sisters;and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
<!--Elsie-->
</a>,
        <a class="sister" href="http://example.com/lacie" id="link2">
            Lacie
        </a>
        and
        <a class="sister" href="http://example.com/tillie" id="link3">
            Tillie
        </a>;
        and
        they lived at the bottom of a well.
    </p>), (1, <body>
<p class="story">
        Once upon a time there were three little sisters;and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
<!--Elsie-->
</a>,
        <a class="sister" href="http://example.com/lacie" id="link2">
            Lacie
        </a>
        and
        <a class="sister" href="http://example.com/tillie" id="link3">
            Tillie
        </a>;
        and
        they lived at the bottom of a well.
    </p>
<p class="story">
        ..
    </p>
</body>), (2, <html>
<head>
<title> The Dormouse‘s story</title>
</head>
<body>
<p class="story">
        Once upon a time there were three little sisters;and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
<!--Elsie-->
</a>,
        <a class="sister" href="http://example.com/lacie" id="link2">
            Lacie
        </a>
        and
        <a class="sister" href="http://example.com/tillie" id="link3">
            Tillie
        </a>;
        and
        they lived at the bottom of a well.
    </p>
<p class="story">
        ..
    </p>
</body></html>), (3, <html>
<head>
<title> The Dormouse‘s story</title>
</head>
<body>
<p class="story">
        Once upon a time there were three little sisters;and their names were
        <a class="sister" href="http://example.com/elsie" id="link1">
<!--Elsie-->
</a>,
        <a class="sister" href="http://example.com/lacie" id="link2">
            Lacie
        </a>
        and
        <a class="sister" href="http://example.com/tillie" id="link3">
            Tillie
        </a>;
        and
        they lived at the bottom of a well.
    </p>
<p class="story">
        ..
    </p>
</body></html>)]

打印后的结果为:

兄弟节点

from bs4 import BeautifulSoup

html ="""
    <html>
        <head>
            <title> The Dormouse‘s story</title>
        </head>
    <body>
    <p class="story">
        Once upon a time there were three little sisters;and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">
            <!--Elsie-->
        </a>,
        <a href="http://example.com/lacie" class="sister" id="link2">
            Lacie
        </a>
        and
        <a href="http://example.com/tillie" class="sister" id="link3">
            Tillie
        </a>;
        and
        they lived at the bottom of a well.
    </p>
    <p class="story">
        ..
    </p>
"""

soup=BeautifulSoup(html,"lxml")
print(list(enumerate(soup.a.next_siblings))) # .descendants就相当于迭代器,获取兄弟节点
print(list(enumerate(soup.a.previous_siblings))) # .descendants就相当于迭代器,获取兄弟节点

[(0, ‘,\n        ‘), (1, <a class="sister" href="http://example.com/lacie" id="link2">
            Lacie
        </a>), (2, ‘\n        and\n        ‘), (3, <a class="sister" href="http://example.com/tillie" id="link3">
            Tillie
        </a>), (4, ‘;\n        and \n        they lived at the bottom of a well.\n    ‘)]
[(0, ‘\n        Once upon a time there were three little sisters;and their names were\n        ‘)]

打印后的结果为:

标准选择器

find_all(name,attrs,recursive,text,**kwargs)

可根据标签名、属性、内容查找文档

from bs4 import BeautifulSoup

html="""
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1>
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</ll>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</ll>
                <ll class="element">Bar</l>
            </ul>
        </div>
    </div>
"""

soup=BeautifulSoup(html,"lxml")
print(soup.find_all("ul"))
print(type(soup.find_all("ul")[0]))

[<ul class="list" element="" id="list-1&gt;&lt;li class=">Foo
<li class="element">Bar</li>
<li class="element">Jay
</li></ul>,
<ul class="list list-small" id="list-2">
<li class="element">Foo
<ll class="element">Bar
</ll></li></ul>]
<class ‘bs4.element.Tag‘>

打印后的结果为:

from bs4 import BeautifulSoup

html="""
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1>
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</ll>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</ll>
                <ll class="element">Bar</l>
            </ul>
        </div>
    </div>
"""

soup=BeautifulSoup(html,"lxml")
for UL in soup.find_all("ul"):
    print(UL.find_all("li"))

[<li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo<ll class="element">Bar</ll></li>]

打印后的结果为:

attrs

from bs4 import BeautifulSoup

html="""
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1 name="elements">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</ll>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</ll>
                <ll class="element">Bar</l>
            </ul>
        </div>
    </div>
"""
soup=BeautifulSoup(html,"lxml")
print(soup.find_all(attrs={"id":"list-1"}))
print(soup.find_all(attrs={"name":"elements"}))

[]
[]

打印后的结果为:

from bs4 import BeautifulSoup

html="""
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1 name="elements">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</ll>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</ll>
                <ll class="element">Bar</l>
            </ul>
        </div>
    </div>
"""
soup=BeautifulSoup(html,"lxml")
print(soup.find_all(id="list-1"))
print(soup.find_all(class_="element"))

[<ul class="list list-small" id="list-2"><li class="element">Foo<ll class="element">Bar</ll></li></ul>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo<llclass="element">Bar</ll></li>, <ll class="element">Bar</ll>]

打印后的结果为:

text

from bs4 import BeautifulSoup

html="""
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1 name="elements">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</ll>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</ll>
                <ll class="element">Bar</l>
            </ul>
        </div>
    </div>
"""
soup=BeautifulSoup(html,"lxml")
print(soup.find_all(text="Foo"))

[‘Foo‘]

打印后的结果为:

find(name,attrs,recursive,text,**kwargs)

find返回单个元素,find_all返回所有元素

from bs4 import BeautifulSoup

html="""
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1 name="elements">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</ll>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</ll>
                <ll class="element">Bar</l>
            </ul>
        </div>
    </div>
"""
soup=BeautifulSoup(html,"lxml")
print(soup.find("ul"))
print(type(soup.find("ul")))
print(soup.find("h4"))

<ul class="list" elements="" id="list-1 name=">
     <li class="element">Foo</li>
     <li class="element">Bar</li>
     <li class="element">Jay</li>
</ul>
<class ‘bs4.element.Tag‘>
<h4>Hello</h4>

打印后的结果为:

find_parents() , find_parent()
find_parents()返回所有祖先节点,find_parent()返回直接父节点。
find_next_siblings() , find_next_sibling()
find_next_siblings()返回后面所有兄弟节点,findnext_sibling0返回后面第一个兄弟节点。
find_previous_ siblings() , find_previous_sibling()
find_previous_siblings0返回前面所有兄弟节点,find_previous_sibling0返回前面第一个兄弟节点。
find_all_next() , find_next()
find_all_next()返回节点后所有符合条件的节点,find_next()返回第一个符合条件的节点
find_all_previous()和find_previous()
find_all_previous()返回节点后所有符合条件的节点,find_previous()返回第一个符合条件的节点

CSS选择器

通过select()直接传入CSS选择器即可完成选择

from bs4 import BeautifulSoup

html="""
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1 name="elements">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</ll>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</ll>
                <ll class="element">Bar</l>
            </ul>
        </div>
    </div>
"""
soup=BeautifulSoup(html,"lxml")
print(soup.select(".panel .panel-heading"))
print(soup.select("ul li"))
print(soup.select("#list-2 .element"))
print(type(soup.select("ul")[0]))

[<div class="panel-heading"><h4>Hello</h4></div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo<ll class="element">Bar</ll></li>]
[<li class="element">Foo<li class="element">Bar</li></li>, <li class="element">Bar</li>]
<class ‘bs4.element.Tag‘>

打印后的结果为:

from bs4 import BeautifulSoup

html="""
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1 name="elements">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</ll>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</ll>
                <ll class="element">Bar</l>
            </ul>
        </div>
    </div>
"""
soup=BeautifulSoup(html,"lxml")
for ul in soup.select("ul"):
    print(ul.select("li"))

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay </li>]
[<li class="element">Foo<ll class="element">Bar</ll></li>]

打印后的结果为:

获取属性

from bs4 import BeautifulSoup

html="""
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1" name="elements">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</ll>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
"""
soup=BeautifulSoup(html,"lxml")
for ul in soup.select("ul"):    print(ul["id"])    print(ul.attrs["id"])

list-1
list-1
list-2
list-2

打印的结果为:

获取内容

from bs4 import BeautifulSoup

html="""
    <div class="panel">
        <div class="panel-heading">
            <h4>Hello</h4>
        </div>
        <div class="panel-body">
            <ul class="list" id="list-1" name="elements">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
                <li class="element">Jay</ll>
            </ul>
            <ul class="list list-small" id="list-2">
                <li class="element">Foo</li>
                <li class="element">Bar</li>
            </ul>
        </div>
    </div>
"""
soup=BeautifulSoup(html,"lxml")
for ul in soup.select("li"):
    print(ul.get_text())

Foo
Bar
Jay

Foo
Bar

打印后的结果为:

总结:
       ●推荐使用lxml解析库,必要时使用html.parser
       ●标签选择筛选功能弱但是速度快
       ●建议使用find()、find_all()查询匹配单个结果或者多个结果
       ●如果对CSS选择器熟悉建议使用select()
       ●记住常用的获取属性和文本值的方法

原文地址:https://www.cnblogs.com/zhuifeng-mayi/p/9685044.html

时间: 2024-10-23 13:12:23

爬虫--BeautifulSoup的相关文章

[python爬虫] BeautifulSoup和Selenium对比爬取豆瓣Top250电影信息

这篇文章主要对比BeautifulSoup和Selenium爬取豆瓣Top250电影信息,两种方法从本质上都是一样的,都是通过分析网页的DOM树结构进行元素定位,再定向爬取具体的电影信息,通过代码的对比,你可以进一步加深Python爬虫的印象.同时,文章给出了我以前关于爬虫的基础知识介绍,方便新手进行学习.        总之,希望文章对你有所帮助,如果存在不错或者错误的地方,还请海涵~ 一. DOM树结构分析 豆瓣Top250电影网址:https://movie.douban.com/top2

python爬虫---beautifulsoup(2)

之前我们使用的是python的自带的解析器html.parser.官网上面还有一些其余的解析器,我们分别学习一下. 解析器 使用方法 优点 缺点 htm.parser BeautifulSoup(markup,'html.parser') 1.python自带的 2.解析速度过得去 3.容错强 2.7之前的版本,和3.3之前不包括2.7的都不支持 lxml`s HTML parser BeautifulSoup(markup,'lxml') 1.非常快 2.容错强 要安装C语言库 lxml`s

Python 爬虫-BeautifulSoup

2017-07-26 10:10:11 Beautiful Soup可以解析html 和 xml 格式的文件. Beautiful Soup库是解析.遍历.维护"标签树"的功能库.使用BeautifulSoup库非常简单,只需要两行代码,就可以完成BeautifulSoup类的创建,这里命名为soup,接下来就可以对soup进行相关处理了.一个BeautifulSoup类对应html或者xml的全部内容. BeautifulSoup库将任意html文件转换成utf-8格式 一.解析器

[爬虫] BeautifulSoup库

Beautiful Soup库基础知识 Beautiful Soup库是解析xml和html的功能库.html.xml大都是一对一对的标签构成,所以Beautiful Soup库是解析.遍历.维护"标签树"的功能库,只要提供的是标签类型Beautiful Soup库都可以进行很好的解析. Beauti Soup库的导入 from bs4 import BeautifulSoup import bs4 html文档 == 标签树 == BeautifulSoup类   可以认为三者是等价

python爬虫---beautifulsoup(1)

beautifulsoup是用于对爬下来的内容进行解析的工具,其find和find_all方法都很有用.并且按照其解析完之后,会形成树状结构,对于网页形成了类似于json格式的key - value这种样子,更容易并且更方便对于网页的内容进行操作. 下载库就不用多说,使用python的pip,直接在cmd里面执行pip install beautifulsoup即可 首先仿照其文档说明,讲代码拷贝过来,如下 from bs4 import BeautifulSoup html_doc = "&q

python爬虫beautifulsoup

1.BeautifulSoup库,也叫beautifulsoup4或bs4 功能:解析HTML/XML文档 2.HTML格式 成对尖括号构成 3.库引用 #bs4为简写,BeautifulSoup为其中一个类 from bs4 import BeautifulSoup #直接引用库 import bs4 3.1.BeautifulSoup类 >>from bs4 import BeautifulSoup >>soup=BeautifulSoup("<html>

网络爬虫BeautifulSoup库的使用

使用BeautifulSoup库提取HTML页面信息 #!/usr/bin/python3 import requests from bs4 import BeautifulSoup url='http://python123.io/ws/demo.html' r=requests.get(url) if r.status_code==200: print('网络请求成功') demo=r.text soup=BeautifulSoup(demo,'html.parser') print(sou

四 . 爬虫 BeautifulSoup库参数和使用

一  .BeautifulSoup库使用和参数 1 .Beautiful简介 简单来说,Beautiful Soup是python的一个库,最主要的功能是从网页抓取数据.官方解释如下: Beautiful Soup提供一些简单的.python式的函数用来处理导航.搜索.修改分析树等功能.它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为简单,所以不需要多少代码就可以写出一个完整的应用程序.Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码.你

&lt;-0基础学python.第2课-&gt;

今天闲着无聊,有想鼓捣Python了,想实现网络爬虫,帮我下载音乐的功能. 现在网上找了相关的一些文章教程 http://jecvay.com/2014/09/python3-web-bug-series1.html 这个博主写的东西给我了一定的启发,不过我不大喜欢动脑子,只想完成目标,所以喜欢拿来主义 使用第三方模块来实现网络爬虫 BeautifulSoup 模块 1 #encoding:UTF-8 2 import requests 3 from bs4 import BeautifulSo