首先说一下,我用的是python 2.7,刚好在学Python,今天想去爬点图片当壁纸,但是当我用 SGMLParser 做 <img> 标签解析的时候,发现我想要的那部分根本没获取到,我尝试用 lxml 修复网页,还是解析不出..但是当我把此部分字段单独提出来时,我却可以将此部分标签解析出来,实在无法解决这个问题...先将问题放在这里,用正则表达式去匹配好了..如果有遇到过此问题的前辈请务必告诉我..我的邮箱是 [email protected]
这是源网站:http://mcyacg.com/m60948/
<div class="quote"><blockquote><font size="5"><font color="Pink">P站引言:</font></font>正好可以放在手心里的红色果实——“苹果”。红色果实切开后看起来像心形的苹果,在白雪公主、亚当与夏娃等故事中都是作为诱惑的象征登场。艳丽的红色外皮包裹着香甜多汁的清脆果实,或许有着一种不可思议的魅力吧。<br/> 今天,就为大家送上描绘了“苹果”的插画作品特辑。敬请欣赏这些仿佛能听见咬下新鲜苹果时的清脆声音的插画作品。</blockquote></div><br/> <a href="http://pan.baidu.com/s/1mhJ4ti4" target="_blank"><font size="5">下载</font></a><br/> <img id="aimg_IPKD8" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i4.qlshw.org/08859a09d120090cfff30152010130c7.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/> <img id="aimg_Sv2k2" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i4.qlshw.org/db3a060b649a422a701dd47982f9cbe5.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/> <img id="aimg_Ob8zD" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i4.qlshw.org/8fd5b9b5f4706b17e71c00939c75f648.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/> <img id="aimg_MWOuz" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i2.qlshw.org/7b5f4a94fff33ea1a7cac45131f2ba41.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/> <img id="aimg_nG9jr" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i3.qlshw.org/4c0a17365342ef700c68c4e4caada0e0.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/> <img id="aimg_J790D" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i2.qlshw.org/a1f2a3486ce679f007abea46782a33b7.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/> <img id="aimg_v6mTz" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i4.qlshw.org/8d37e5d40f34c03180080135e8757bc8.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/> <img id="aimg_wzFQq" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i2.qlshw.org/7feed5d205b6811d5d1366dd495a0760.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/> <img id="aimg_xWlS5" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i2.qlshw.org/413f8d116e31174451032abfa72c1246.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/> <img id="aimg_O8T03" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i4.qlshw.org/f654b77544edfeee9cda4d069e704c90.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/> <img id="aimg_PfGhH" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i2.qlshw.org/e6fb8b6eafd5ef5dea2a13a284ba8309.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/> <img id="aimg_tZEBu" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i3.qlshw.org/46d344876774e7dcb059ef84e2fc70f7.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/> <img id="aimg_PnP6y" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i1.qlshw.org/9c6ab03cffb678a0945dcb0da127ea63.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/> <img id="aimg_H01fi" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i1.qlshw.org/d0bf9d03f427a730b40a29bfebc9697a.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/> <img id="aimg_j1pqX" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i3.qlshw.org/36074139da9039d1d4c0d1042f6b1b8c.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/> <img id="aimg_xaHP0" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i3.qlshw.org/a40f503868cdd657531cbea34adf55e6.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/> <img id="aimg_wE44O" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i1.qlshw.org/0a5d24a51f5c4ad0041d8feec5b5fe9a.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/> <img id="aimg_T50cd" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i3.qlshw.org/b9f666b571221894bd4d922d369fce5d.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/> <img id="aimg_C4o7y" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i4.qlshw.org/114b3eff9da458cfbfb52d08160fd30a.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/> <img id="aimg_m2c3s" onclick="zoom(this, this.src, 0, 0, 0)" class="zoom" file="http://i3.qlshw.org/ba3265f867650ef35891f6cb09a7a196.jpg" onmouseover="img_onmouseoverfunc(this)" lazyloadthumb="1" border="0" alt=""/><br/> <br/>
这部分是我想要提取的元素.
from sgmllib import SGMLParser class LableParse(SGMLParser): def reset(self): SGMLParser.reset(self) self.level = 0 self.flag = False self.picturesrc=[] def start_div(self,attrs): if self.flag == True: #遇到子层 level+1 self.level+=1 for k,v in attrs: if k==‘class‘ and v==‘pct‘: self.flag = True self.level+=1 #自己加一层 def end_div(self): if self.level == 0: self.flag = False if self.flag == True: #退出DIV子层的时候level-1 self.level-=1 def start_img(self,attrs): #if self.flag == True: for k,v in attrs: print ‘{%s : %s}‘%(k,v) if __name__ == ‘__main__‘: lp = LableParse() lp.feed(open(‘source.txt‘).read())
这部分是我继承自 SGMLParser 的一个类..
时间: 2024-10-13 12:43:54