python爬虫获取图片

import re
import os
import urllib
#根据给定的网址来获取网页详细信息，得到的html就是网页的源代码
def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    return html.decode(‘UTF-8‘)

def getImg(html):
    reg = r‘src="(.+?\.jpg)" pic_ext‘
    imgre = re.compile(reg)
    imglist = imgre.findall(html)#表示在整个网页中过滤出所有图片的地址，放在imglist中
    x = 0
    path = ‘D:\\test‘
   # 将图片保存到D:\\test文件夹中，如果没有test文件夹则创建
    if not os.path.isdir(path):
        os.makedirs(path)
    paths = path+‘\\‘      #保存在test路径下  

    for imgurl in imglist:
        urllib.request.urlretrieve(imgurl,‘{}{}.jpg‘.format(paths,x))  #打开imglist中保存的图片网址，并下载图片保存在本地，format格式化字符串
        x = x + 1
    return imglist
html = getHtml("http://tieba.baidu.com/p/2460150866")#获取该网址网页详细信息，得到的html就是网页的源代码
print (getImg(html)) #从网页源代码中分析并下载保存图片

原文地址：https://www.cnblogs.com/Optimism/p/10183612.html

时间： 2024-08-01 06:36:43

python爬虫获取图片的相关文章

Python爬虫获取图片并下载保存至本地的实例

今天小编就为大家分享一篇Python爬虫获取图片并下载保存在本地的实例,具有很好的参考价值,希望对大家有所帮助.一起来看看吧! 1.抓取煎蛋网上的图片 2.代码如下 * * * import urllib.request import os def url_open(url): req=urllib.request.Request(url) req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.3; WOW64; rv:51.0) Geck

Python爬虫（图片）编写过程中遇到的问题

最近我突然对网络爬虫开窍了,真正做起来的时候发现并不算太难,都怪我以前有点懒,不过近两年编写了一些程序,手感积累了一些肯定也是因素,总之,还是惭愧了.好了,说正题,我把这两天做爬虫的过程中遇到的问题总结一下: 需求:做一个爬虫,爬取一个网站上所有的图片(只爬大图,小图标就略过) 思路:1.获取网站入口,这个入口网页上有很多图片集合入口,进入这些图片集合就能看到图片链接了,所以爬取的深度为2,比较简单:2.各个子图片集合内所包含的图片链接有两种形式:一种是绝对图片路径(直接下载即可),另一种的相对

python 爬虫获取文件式网站资源完整版（基于python 3.6）

<--------------------------------下载函数-----------------------------> import requestsimport threading # 传入的命令行参数,要下载文件的url# url = 'http://www.nco.ncep.noaa.gov/pmb/codes/nwprod/nosofs.v3.0.4/fix/cbofs/nos.cbofs.romsgrid.nc' def Handler(start, end, url

Python爬虫网页图片

一概述参考http://www.cnblogs.com/abelsu/p/4540711.html 弄了个Python捉取单一网页的图片,但是Python已经升到3+版本了.参考的已经失效,基本用不上.修改了下,重新实现网页图片捉取. 二代码 #coding=utf-8 #urllib模块提供了读取Web页面数据的接口 import urllib #re模块主要包含了正则表达式 import re import urllib.parse import urllib.request #定义一

Python爬虫获取JSESSIONID登录网站

在使用Python对一些网站的数据进行采集时,经常会遇到需要登录的情况.这些情况下,使用FireFox等浏览器登录时,自带的调试器(快捷键F12)就可以看到登录的时候网页向服务器提交的信息,把这部分信息提取出来就可以利用Python 的 urllib2 库结合Cookie进行模拟登录然后采集数据,如以下代码: #coding=utf-8 import urllib import urllib2 import httplib import cookielib url = 'http://www.x

Python爬虫获取迅雷会员帐号

代码如下: 1 import re 2 import urllib.request 3 import urllib 4 import time 5 6 from collections import deque 7 8 head = { 9 'Connection': 'Keep-Alive', 10 'Accept': 'text/html, application/xhtml+xml, */*', 11 'Accept-Language': 'en-US,en;q=0.8,zh-Hans-C

python 爬虫获取世界杯比赛赛程

#!/usr/bin/python # -*- coding:utf8 -*- import requests import re import os import time # from urllib import json from bs4 import BeautifulSoup from datetime import date def getTimeExpire(time_play,time_gap): # print(time_play) try: time_arr=time.str

爬虫 --- 获取图片并处理中文乱码

爬取网站图片运用requests模块处理url并获取数据,etree中xpath方法解析页面标签,urllib模块urlretrieve保存图片,"iso-8859-1"处理中文乱码 #爬取图片并且处理乱码 import requests from lxml import etree #urlretrieve可以直接保存图片 from urllib import request url = "http://pic.netbian.com/4kqiche/" hea

python 爬虫获取文件式网站资源（基于python 3.6）

import urllib.request from bs4 import BeautifulSoup from urllib.parse import urljoin from Cat.findLinks import get_link from Cat.Load import Schedule import osimport timeimport errno -------import的其余包代码---------------- def get_link(page): # 寻找链接的href