python3: 爬虫---- urllib, beautifulsoup

最近晚上学习爬虫，首先从基本的开始；

python3 将urllib,urllib2集成到urllib中了, urllib可以对指定的网页进行请求下载， beautifulsoup 可以从杂乱的html代码中

分离出我们需要的部分；

注： beautifulsoup 是一种可以从html 或XML文件中提取数据的python库；

实例1：

from urllib import request
from bs4 import BeautifulSoup as bs
import re

header = {
    ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36‘
}

def download():
    """
     模拟浏览器进行访问；
    :param url:
    :return:
    """
    for pageIdx in range(1, 3, 1):
        #print(pageIdx)
        url = "https://www.cnblogs.com/#p%s" % str(pageIdx)
        print(url)
        req = request.Request(url, headers=header)
        rep = request.urlopen(req).read()
        data = rep.decode(‘utf-8‘)
        print(data)
        content = bs(data)
        for link in content.find_all(‘h3‘):
            content1 = bs(str(link), ‘html.parser‘)
            print(content1.a[‘href‘],content1.a.string)
            curhtmlcontent = request.urlopen(request.Request(content1.a[‘href‘], headers=header)).read()
            #print(curhtmlcontent.decode(‘utf-8‘))
            open(‘%s.html‘ % content1.a.string, ‘w‘,encoding=‘utf-8‘).write(curhtmlcontent.decode(‘utf-8‘))

if __name__ == "__main__":
    download()

实例2：

# -- coding: utf-8 --
import unittest
import  lxml
import requests
from bs4 import BeautifulSoup as bs

def  school():
    for index in range(2, 34, 1):
        try:
            url="http://gaokao.chsi.com.cn/gkxx/zszcgd/dnzszc/201706/20170615/1611254988-%s.html" % str(index)
            r = requests.get(url=url)
            soup = bs(r.content, ‘lxml‘)
            city = soup.find_all(name="td",attrs={"colspan":"7"})[0].string
            fp = open("%s.txt" %(city), "w", encoding="utf-8")
            content1 = soup.find_all(name="tr", attrs={"height": "29"})
            for content2 in content1:
                try:
                    contentTemp = bs(str(content2), "lxml")
                    soup_content = contentTemp.find_all(name="td")[1].string
                    fp.write(soup_content + "\n")
                    print(soup_content)
                except IndexError:
                    pass
            fp.close()
        except IndexError:
            pass

class MyTestCase(unittest.TestCase):
    def test_something(self):
        school()

if __name__ == ‘__main__‘:
    unittest.main()

BeatifulSoup支持很多HTML解析器（下面是一些主要的）：

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, “html.parser”)	(1)Python的内置标准库(2)执行速度适中(3)文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML解析器	BeautifulSoup(markup, “lxml”)	(1)速度快(2)文档容错能力强	需要安装C语言库
lxml XML解析器	BeautifulSoup(markup, [“lxml”, “xml”]) OR BeautifulSoup(markup, “xml”)	(1)速度快(2)唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, “html5lib”)	(1)最好的容错性(2)以浏览器的方式解析文档(3)生成HTML5格式的文档	(1)速度慢(2)不依赖外部扩展

原文地址：https://www.cnblogs.com/yinwei-space/p/9320640.html

时间： 2024-11-05 22:00:38

python3: 爬虫---- urllib, beautifulsoup的相关文章

Python3 爬虫（八） -- BeautifulSoup之再次爬取CSDN博文

序我的Python3爬虫(五)博文使用utllib基本函数以及正则表达式技术实现了爬取csdn全部博文信息的任务. 链接:Python3 爬虫(五) -- 单线程爬取我的CSDN全部博文上一篇,我们学习了BeautifulSoup这样一个优秀的Python库,必须有效利用起来.那么我们就利用BeautifulSoup4重新实现一次爬取csdn博文的任务. 由于我修改了博客配置,首页主题换了一下,我们基于新的主题查看网页,如下图所示: 同样的,确认要提取的信息,以及博文总页数. 分析网页源码

python爬虫实例（urllib&BeautifulSoup）

python 2.7.6 urllib:发送报文并得到response BeautifulSoup:解析报文的body(html) #encoding=UTF-8 from bs4 import BeautifulSoup from urllib import urlopen import urllib list_no_results=[]#没查到的银行卡的list list_yes_results=[]#已查到的银行卡的list #解析报文,以字典存储 def parseData(htmls,

Python3 使用 urllib 编写爬虫

什么是爬虫爬虫,也叫蜘蛛(Spider),如果把互联网比喻成一个蜘蛛网,Spider就是一只在网上爬来爬去的蜘蛛.网络爬虫就是根据网页的地址来寻找网页的,也就是URL.举一个简单的例子,我们在浏览器的地址栏中输入的字符串就是URL,例如:https://www.baidu.com URL就是同意资源定位符(Uniform Resource Locator),它的一般格式如下(带方括号[]的为可选项): protocol :// hostname[:port] / path / [;parame

python3 爬虫之爬取糗事百科

闲着没事爬个糗事百科的笑话看看 python3中用urllib.request.urlopen()打开糗事百科链接会提示以下错误 http.client.RemoteDisconnected: Remote end closed connection without response 但是打开别的链接就正常,很奇怪不知道为什么,没办法改用第三方模块requests,也可以用urllib3模块,还有一个第三方模块就是bs4(beautifulsoup4) requests模块安装和使用,这里就不说

Python 爬虫—— requests BeautifulSoup

本文记录下用来爬虫主要使用的两个库.第一个是requests,用这个库能很方便的下载网页,不用标准库里面各种urllib:第二个BeautifulSoup用来解析网页,不然自己用正则的话很烦. requests使用,1直接使用库内提供的get.post等函数,在比简单的情况下使用,2利用session,session能保存cookiees信息,方便的自定义request header,可以进行登陆操作. BeautifulSoup使用,先将requests得到的html生成BeautifulSo

python3 爬虫小例子

#!/usr/bin/env python# -*- coding: utf-8 -*- import sys,reimport urllib.request,urllib.parse,http.cookiejar class myW3(): def login(self): 'post 数据' data = {"uid":self.uid,'password':self.password,'actionFlag':'loginAuthenticate

python3爬虫中文乱码之请求头‘Accept-Encoding’：br 的问题

当用python3做爬虫的时候,一些网站为了防爬虫会设置一些检查机制,这时我们就需要添加请求头,伪装成浏览器正常访问. header的内容在浏览器的开发者工具中便可看到,将这些信息添加到我们的爬虫代码中即可. 'Accept-Encoding':是浏览器发给服务器,声明浏览器支持的编码类型.一般有gzip,deflate,br 等等. python3中的 requests包中response.text 和 response.content response.content #字节方式的响应体,会

【Python3 爬虫】U11_BeautifulSoup4之select和CCS选择器提取元素

目录 1.常用CSS选择器介绍 1.1 标签选择器 1.2 类名选择器 1.3 id选择器 1.4 查找子孙元素 1.5 查找直接子元素 1.6 根据属性查找 2.实战演练:select和css选择器提取元素 2.1 获取所有的p标签 2.2 获取第2个p标签 2.3 获取所有class等于t3的span标签 2.4 获取class为t1的p标签下的所有a标签的href属性 2.5 获取所有的职位信息(文本) 1.常用CSS选择器介绍以下是一个包含常用类选择器的案例,在案例后有具体的选择器使用

python爬虫Urllib实战

Urllib基础 urllib.request.urlretrieve(url,filenname) 直接将网页下载到本地 import urllib.request >>> urllib.request.urlretrieve("http://www.hellobi.com",filename="D:\/1.html") ('D:\\/1.html', <http.client.HTTPMessage object at 0x0000000