【python小练】图片爬虫之BeautifulSoup4

Python3用不了Scrapy!

[重要的事情说三遍，据说大神们还在尝试把scrapy移植到python3，特么浪费我半个小时pip scrapy = - =]

先前用正则表达式匹配出符合要求的<img>标签真的超麻烦的，正则式错一点点都要完蛋，用bs4感觉方便很多。

bs4是将整个html拆解成字典和数组，所以处理起来比较简单。

以这个页面为例（毕竟堆糖本命）：http://www.duitang.com/search/?kw=%E6%96%87%E8%B1%AA%E9%87%8E%E7%8A%AC&type=feed#!s-p1

要下载我想要的图片，最终目标是图片的url数据。

先看页面源码：

1. 读取页面代码：

html_doc = urllib.request.urlopen(url + "#!s-p" + str(n+x-1)).read().decode(‘utf-8‘)
soup = BeautifulSoup(html_doc, "lxml")

2. 见上图，我想下载的图片都包含在符合【属于class="a"的<a>标签】这个特点的<a>标签下，用bs4找出这些<a>标签，用下面这句代码：

soup.find_all(‘a‘, class_=‘a‘)
#soup.find_all(‘(标签名)‘,(符合属性))

3. 从中找出图片<img>标签，并获取链接地址url到img_src：

for myimg in soup.find_all(‘a‘, class_=‘a‘):
     img_src = myimg.find(‘img‘).get(‘src‘)

从第二步来看确实是比纯粹用正则表达式省时省力。

完整代码如下，其实也只改了正则那一小部分：

from bs4 import BeautifulSoup
import urllib.request
import os

def downlaodimg(url,m,n):

    os.chdir(os.path.join(os.getcwd(), ‘photos‘))
    t = 1  # 记录图片张数

    for x in range(n-m+1):
        html_doc = urllib.request.urlopen(url + "#!s-p" + str(n+x-1)).read().decode(‘utf-8‘)
        soup = BeautifulSoup(html_doc, "lxml")

        for myimg in soup.find_all(‘a‘, class_=‘a‘):
            pic_name = str(t) + ‘.jpg‘
            img_src = myimg.find(‘img‘).get(‘src‘)
            urllib.request.urlretrieve(img_src, pic_name)
            print("Success!" + img_src)
            t += 1
        print("Next page!")

downlaodimg("http://www.duitang.com/search/?kw=%E6%96%87%E8%B1%AA%E9%87%8E%E7%8A%AC&type=feed",1,3)

和前一篇一样添加了起始页和终止页两个参数。

下载后文件夹：

ps:太宰桑真是太萌辣(●‘?‘●)?♥不说了再去看一遍~

时间： 2024-12-25 14:53:38

【python小练】图片爬虫之BeautifulSoup4的相关文章

【python小练】0013

第 0013 题: 用 Python 写一个爬图片的程序,爬这个链接里的日本妹子图片 :-) 科科...妹子就算了,大晚上的爬点吃的吧.食物图集:抿一口,舔一舔,扭一扭~·SCD 写个简单的爬图爬虫方法还蛮多的. 这次尝试用urlib.request来实现. 读取图片网源码,利用re.compile找到其中符合要求的img标签生成图片list,最后用request.urlretrieve下载图片到本地. Code: import os import re import urllib.reque

【python小练】0010

第 0010 题:使用 Python 生成类似于下图中的字母验证码图片思路: 1. 随机生成字符串 2. 创建画布往上头写字符串 3. 干扰画面 code: # codeing: utf-8 from PIL import Image, ImageDraw, ImageFont, ImageFilter import string import random def get4char(): return [random.choice(string.ascii_letters) for _ in

python写个图片爬虫

[[email protected] pythonscript]# vim getimg.py #!/usr/bin/python #encoding:utf8 import requests,sys,re #定义一个方法,获取网站图片,并下载 def getimg(url): #请求url内容 page=requests.get(url) #获取内容 pagetext=page.content #定义正则表达式. reg=r'src=.*?\.jpg' #对获取的内容进行匹配 imglist=

python 斗图图片爬虫

捣鼓了三小时,有一些小Bug,望大佬指导废话不说,直接上代码: #!/usr/bin/python3 # -*- coding:UTF-8 -*- import os,re,requests from urllib import request,parse class Doutu_api(object): def __init__(self): self.api_html = r'http://www.doutula.com/search?keyword=%s' self.headers =

【python小练】0001

第 0001 题:做为 Apple Store App 独立开发者,你要搞限时促销,为你的应用生成激活码(或者优惠券),使用 Python 如何生成 200 个激活码(或者优惠券)? # coding = utf-8 __author__= 'liez' import random def make_number(num, length): str = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' a = []

【python小练】0002

第 0002 题:将 0001 题生成的 200 个激活码(或者优惠券)保存到 MySQL 关系型数据库中. . . .(一脸懵逼) Python访问数据库:(廖雪峰python教程) 1. SQLite是一种轻量级的嵌入式数据库,其数据库就是一个文件.Python中内置SQLite3,无需另外安装. 要操作数据库,首先要连接到数据库,连接称作“Connection”. 连接数据库后,需要打开游标,称为“Cursor”,通过“Cursor”执行SQL语句,获得执行结果. 实践: # 导入SQLi

【python小练】0014

第 0014 题: 纯文本文件 student.txt为学生信息, 里面的内容(包括花括号)如下所示: { "1":["张三",150,120,100], "2":["李四",90,99,95], "3":["王五",60,66,68] } 请将上述内容写到 student.xls 文件中,如下图所示: 这题用到之前提到的python第三方库xlwt.(= - =pip一安真的是一劳永逸

Python小练：（三：打包、eavl()函数、冒泡排序）

运行结果: —————————————————————————————————————————— 运行结果: —————————————————————————————————————————— # 第三题:使用python实现冒泡排序def BubbleSort(list): long = len(list) for i in range(0,long): for j in range(i,long): if list[i] < list[j]: list [i],list[j] = list

python实现简单图片爬虫并保存

先po代码 #coding=utf-8 import urllib.request #3之前的版本直接用urllib即可,下同 #该模块提供了web页面读取数据的接口,使得我们可以像读取本地文件一样读取www或者ftp上的数据 import re import os def getHtml(url): page = urllib.request.urlopen(url); html = page.read(); return html; def getImg(html): imglist = r