在网上冲浪的时候,总有些“小浪花”令人喜悦。没错,小浪花就是美图啦。边浏览边下载,自然是不错的;不过,好花不常开,好景不常在,想要便捷地保存下来,一个个地另存为还是很麻烦的。能不能批量下载呢? 只要获得图片地址,还是不难的。
目标
太平洋摄影网, 一个不错的摄影网站。 如果你喜欢自然风光的话,不妨在上面好好饱览一顿吧。饱览一顿,或许你还想打包带走呢。这并不是难事,让我们顺藤摸瓜地来尝试一番吧(懒得截图,自己打开网站观赏吧)。
首先,我们打开网址 http://dp.pconline.com.cn/list/all_t145.html ; 那么,马上有N多美妙的缩略图呈现在你面前。任意点击其中一个链接,就到了一个系列的第一张图片的页面: http://dp.pconline.com.cn/photo/3687487.html, 再点击下可以到第二张图片的页面: http://dp.pconline.com.cn/photo/3687487_2.html, 图片下方点击“查看原图”, 会跳转到 http://dp.pconline.com.cn/public/photo/source_photo.jsp?id=19706865&photoId=3687487 这个页面,呈现出一张美美的高清图。右键另存为,就可以保存到本地。也许你的心已经开始痒痒啦。
该如何下手呢? 只要你做过 web 开发,一定知道,在浏览器的控制台,会有页面的 html , 而 html 里会包含图片, 或者是包含图片的另一个 HTML。对于上面的情况而言, http://dp.pconline.com.cn/list/all_t145.html 是一个大主题系列的入口页面,比如自然是 t145, 建筑是 t292, 记作 EntryHtml ;这个入口页面包含很多链接指向子的HTML,这些子 HTML 是这个大主题下的不同个性风格的摄影师拍摄的不同系列的美图, 记作 SerialHtml ; 而这些 SerialHtml 又会包含一个子系列每一张图片的首 HTML,记作 picHtml , 这个 picHtml 包含一个“查看原图”链接,指向图片高清地址的链接 http://dp.pconline.com.cn/public/photo/source_photo.jsp?id=19706865&photoId=3687487 , 记作 picOriginLink ; 最后, 在 picOriginLink 里找到 img 元素,即高清图片的真真实地址 picOrigin。 (⊙v⊙)嗯,貌似有点绕晕了,我们来总结一下:
EntryHtml (主题入口页面) -> SerialHtml (子系列入口页面) -> picHtml (子系列图片浏览页面) -> picOriginLink (高清图片页面) -> picOrigin (高清图片的真实地址)
现在,我们要弄清楚这五级是怎么关联的。
经过查看 HTML 元素,可知:
(1) SerialHtml 元素是 EntryHtml 页面里的 class="picLink" 的 a 元素;
(2) picHtml 元素是 SerialHtml 的加序号的结果,比如 SerialHtml 是 http://dp.pconline.com.cn/photo/3687487.html, 总共有 8 张,那么 picHtml = http://dp.pconline.com.cn/photo/3687487_[1-8].html ,注意到 http://dp.pconline.com.cn/photo/3687487.html 与 http://dp.pconline.com.cn/photo/3687487_1.html 是等效的,这会给编程带来方便。
(3) “查看原图” 是指向高清图片地址的页面 xxx.jsp 的链接:它是 picHtml 页面里的 class="aView aViewHD" 的 a 元素;
(4) 最后,从 xxx.jsp 元素中找出 src 为图片后缀的 img 元素即可。
那么,我们的总体思路就是:
STEP1: 抓取 EntryHtml 的网页内容 entryContent ;
STEP2: 解析 entryContent ,找到class="picLink" 的 a 元素列表 SerialHtmlList ;
STEP3: 对于SerialHtmlList 的每一个网页 SerialHtml_i:
(1) 抓取其第一张图片的网页内容, 解析出其图片总数 total ;
(2) 根据图片总数 total 并生成 total 个图片链接 picHtmlList ;
a. 对于 picHtmlList 的每一个网页,找到 class="aView aViewHD" 的 a 元素 hdLink ;
b. 抓取 hdLink 对应的网页内容,找到img元素 获得最终的 图片 真实地址 picOrigin ;
c. 下载 picOrigin 。
注意到, 一个主题系列有多页,比如首页是 EntryHtml :http://dp.pconline.com.cn/list/all_t145.html , 第二页是 http://dp.pconline.com.cn/list/all_t145_p2.html ;首页等效于 http://dp.pconline.com.cn/list/all_t145_p1.html 这会给编程带来方便。要下载一个主题下多页的系列图片,只要在最外层再加一层循环。这就是串行版本的实现流程。
串行版本实现:
#!/usr/bin/python #_*_encoding:utf-8_*_ import os import re import sys import requestsfrom bs4 import BeautifulSoup saveDir = os.environ[‘HOME‘] + ‘/joy/pic/pconline/nature‘ def catchExc(func): def _deco(*args, **kwargs): try: return func(*args, **kwargs) except Exception as e: print "error catch exception for %s (%s, %s)." % (func.__name__, str(*args), str(**kwargs)) print e return None return _deco @catchExc def getSoup(url): ‘‘‘ get the html content of url and transform into soup object in order to parse what i want later ‘‘‘ result = requests.get(url) status = result.status_code if status != 200: return None resp = result.text soup = BeautifulSoup(resp, "lxml") return soup @catchExc def parseTotal(soup): ‘‘‘ parse total number of pics in html tag <span class="totPic"> (1/total)</span> ‘‘‘ totalNode = soup.find(‘span‘, class_=‘totPics‘) total = int(totalNode.text.split(‘/‘)[1].replace(‘)‘,‘‘)) return total @catchExc def buildSubUrl(href, ind): ‘‘‘ if href is http://dp.pconline.com.cn/photo/3687736.html, total is 10 then suburl is http://dp.pconline.com.cn/photo/3687736_[1-10].html which contain the origin href of picture ‘‘‘ return href.rsplit(‘.‘, 1)[0] + "_" + str(ind) + ‘.html‘ @catchExc def download(piclink): ‘‘‘ download pic from pic href such as http://img.pconline.com.cn/images/upload/upc/tx/photoblog/1610/21/c9/28691979_1477032141707.jpg ‘‘‘ picsrc = piclink.attrs[‘src‘] picname = picsrc.rsplit(‘/‘,1)[1] saveFile = saveDir + ‘/‘ + picname picr = requests.get(piclink.attrs[‘src‘], stream=True) with open(saveFile, ‘wb‘) as f: for chunk in picr.iter_content(chunk_size=1024): if chunk: f.write(chunk) f.flush() f.close() @catchExc def downloadForASerial(serialHref): ‘‘‘ download a serial of pics ‘‘‘ href = serialHref subsoup = getSoup(href) total = parseTotal(subsoup) print ‘href: %s *** total: %s‘ % (href, total) for ind in range(1, total+1): suburl = buildSubUrl(href, ind) print "suburl: ", suburl subsoup = getSoup(suburl) hdlink = subsoup.find(‘a‘, class_=‘aView aViewHD‘) picurl = hdlink.attrs[‘href‘] picsoup = getSoup(picurl) piclink = picsoup.find(‘img‘, src=re.compile(".jpg")) download(piclink) @catchExc def downloadAllForAPage(entryurl): ‘‘‘ download serial pics in a page ‘‘‘ soup = getSoup(entryurl) if soup is None: return #print soup.prettify() picLinks = soup.find_all(‘a‘, class_=‘picLink‘) if len(picLinks) == 0: return hrefs = map(lambda link: link.attrs[‘href‘], picLinks) print ‘serials in a page: ‘, len(hrefs) for serialHref in hrefs: downloadForASerial(serialHref) def downloadEntryUrl(serial_num, index): entryUrl = ‘http://dp.pconline.com.cn/list/all_t%d_p%d.html‘ % (serial_num, index) print "entryUrl: ", entryUrl downloadAllForAPage(entryUrl) return 0 def downloadAll(serial_num): start = 1 end = 2 return [downloadEntryUrl(serial_num, index) for index in range(start, end+1)] serial_num = 145 if __name__ == ‘__main__‘: downloadAll(serial_num)
很显然,串行版本会比较慢,CPU 长时间等待网络连接和操作。要提高性能,通常是采用如下措施:
(1) 使用多线程将 io 密集型操作隔离开,避免CPU等待;
(2) 单个循环操作改为批量操作,更好地利用并发;
(3) 使用多进程进行 CPU 密集型操作,更充分利用多核的力量。
批量并发版本(线程池貌似有点多,有点不稳定,后优化):
#!/usr/bin/python #_*_encoding:utf-8_*_ import os import re import sys from multiprocessing.dummy import Pool as ThreadPool import requests from bs4 import BeautifulSoup saveDir = os.environ[‘HOME‘] + ‘/joy/pic/pconline‘ dwpicPool = ThreadPool(20) def catchExc(func): def _deco(*args, **kwargs): try: return func(*args, **kwargs) except Exception as e: print "error catch exception for %s (%s, %s): %s" % (func.__name__, str(*args), str(**kwargs), e) return None return _deco @catchExc def batchGetSoups(urls): ‘‘‘ get the html content of url and transform into soup object in order to parse what i want later ‘‘‘ urlnum = len(urls) if urlnum == 0: return [] getUrlPool = ThreadPool(urlnum) results = [] for i in range(urlnum): results.append(getUrlPool.apply_async(requests.get, (urls[i], ))) getUrlPool.close() getUrlPool.join() soups = [] for res in results: r = res.get(timeout=1) status = r.status_code if status != 200: continue resp = r.text soup = BeautifulSoup(resp, "lxml") soups.append(soup) return soups @catchExc def parseTotal(soup): ‘‘‘ parse total number of pics in html tag <span class="totPic"> (1/total)</span> ‘‘‘ totalNode = soup.find(‘span‘, class_=‘totPics‘) total = int(totalNode.text.split(‘/‘)[1].replace(‘)‘,‘‘)) return total @catchExc def buildSubUrl(href, ind): ‘‘‘ if href is http://dp.pconline.com.cn/photo/3687736.html, total is 10 then suburl is http://dp.pconline.com.cn/photo/3687736_[1-10].html which contain the origin href of picture ‘‘‘ return href.rsplit(‘.‘, 1)[0] + "_" + str(ind) + ‘.html‘ @catchExc def downloadPic(piclink): ‘‘‘ download pic from pic href such as http://img.pconline.com.cn/images/upload/upc/tx/photoblog/1610/21/c9/28691979_1477032141707.jpg ‘‘‘ picsrc = piclink.attrs[‘src‘] picname = picsrc.rsplit(‘/‘,1)[1] saveFile = saveDir + ‘/‘ + picname picr = requests.get(piclink.attrs[‘src‘], stream=True) with open(saveFile, ‘wb‘) as f: for chunk in picr.iter_content(chunk_size=1024): if chunk: f.write(chunk) f.flush() f.close() @catchExc def getOriginPicLink(subsoup): hdlink = subsoup.find(‘a‘, class_=‘aView aViewHD‘) return hdlink.attrs[‘href‘] @catchExc def downloadForASerial(serialHref): ‘‘‘ download a serial of pics ‘‘‘ href = serialHref subsoups = batchGetSoups([href]) total = parseTotal(subsoups[0]) print ‘href: %s *** total: %s‘ % (href, total) suburls = [buildSubUrl(href, ind) for ind in range(1, total+1)] subsoups = batchGetSoups(suburls) picUrls = map(getOriginPicLink, subsoups) picSoups = batchGetSoups(picUrls) piclinks = map(lambda picsoup: picsoup.find(‘img‘, src=re.compile(".jpg")), picSoups) dwpicPool.map_async(downloadPic, piclinks) def downloadAllForAPage(entryurl): ‘‘‘ download serial pics in a page ‘‘‘ soups = batchGetSoups([entryurl]) if len(soups) == 0: return soup = soups[0] #print soup.prettify() picLinks = soup.find_all(‘a‘, class_=‘picLink‘) if len(picLinks) == 0: return hrefs = map(lambda link: link.attrs[‘href‘], picLinks) for serialHref in hrefs: downloadForASerial(serialHref) def downloadAll(serial_num, start, end): entryUrl = ‘http://dp.pconline.com.cn/list/all_t%d_p%d.html‘ entryUrls = [ (entryUrl % (serial_num, ind)) for ind in range(start, end+1)] taskpool = ThreadPool(20) taskpool.map_async(downloadAllForAPage, entryUrls) taskpool.close() taskpool.join() if __name__ == ‘__main__‘: serial_num = 145 downloadAll(serial_num, 1, 2)