python中,有三个库可以解析html文本,HTMLParser,sgmllib,htmllib。他们的实现方法不通,但功能差不多。这三个库中 提供解析html的类都是基类,本身并不做具体的工作。他们在发现的元件后(如标签、注释、声名等),会调用相应的函数,这些函数必须重载,因为基类中不作处理。
用Python中自带的HTMLPaeser模块,解析下面的HTMl文件
要求:1、获取到每一个漏洞的名称,CVE号,风险值
2、显示每一个漏洞单独显示,不要堆叠在一起
3、只获取高风险的漏洞
<html> <head> <title>search</title> <meta http-equiv="Content-Type" content="text/html; charset=gb2312"> <LINK href="include/bbs.css" rel=stylesheet> </head> <body bgcolor="#ffffff" text="#000000" leftmargin="0" topmargin="0"><br> <div id="Layer2" style="position:absolute; left:25%; top:99px; width:71%; height:265px; z-index:2; overflow: auto" class="bordernobackground"> <table width="100%" border="0" height="29" align="center" cellspacing="1" cellpadding="1" bordercolordark="#FFFFFF" bordercolorlight="#000000" class="a2"> <tr class="a1" height="22"> <td width="9%" class="a8">ID</td> <td class="a8">检测名称</td> <td width="14%" class="a8">CVE号</td> <td width="20%" class="a8">检测类别</td> <td width="15%" class="a8">风险级别</td> </tr> <tr class="a1" height="22"> <td class="a9">1</td> <td class="a9"> <a href="javascript:openwindow(0);"> FTP缓冲区溢出</a> </td> <td class="a9"> <a href=‘http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-1999-0789‘ target=‘_blank‘> CVE-1999-0789</a> </td> <td class="a9"> FTP测试 </td> <td class="a9"> <font color=#FF00FF>高风险</font> </td> </tr> <tr class="a1" height="22"> <td class="a9">2</td> <td class="a9"> <a href="javascript:openwindow(2);"> AFS客户版本</a> </td> <td class="a9"> </td> <td class="a9"> 信息获取测试 </td> <td class="a9"> <font color=#00CC00>信息</font> </td> </tr> <tr class="a1" height="22"> <td class="a9">1</td> <td class="a9"> <a href="javascript:openwindow(1);"> ACC 路由器无需认证显示配置信息</a> </td> <td class="a9"> <a href=‘http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-1999-0383‘ target=‘_blank‘> CVE-1999-0383</a> </td> <td class="a9"> 网络设备测试 </td> <td class="a9"> <font color=#FFCC00>中风险</font> </td> </tr> <tr class="a1" height="22"> <td class="a9">3</td> <td class="a9"> <a href="javascript:openwindow(17);"> Knox Arkeia 缓冲区溢出</a> </td> <td class="a9"> <a href=‘http://cve.mitre.org/cgi-bin/cvename.cgi?name=CAN-1999-1534‘ target=‘_blank‘> CAN-1999-1534</a> </td> <td class="a9"> 杂项测试 </td> <td class="a9"> <font color=#FF00FF>高风险</font> </td> </tr> </table> </div> </body> </html>
Python程序
html_get.py
class CustomParser(HTMLParser.HTMLParser): ‘‘‘ 定义一个新的HTMLParser类,覆盖用到的方法 ‘‘‘ cve_list = [] sigle_cve = [] selected = (‘table‘, ‘div‘, ‘tr‘, ‘td‘, ‘a‘,‘font‘) #需要解析的标签 selected_a = [‘table/div/tr/td/a‘] #需要获取标签a数据的路径 selected_font = [‘table/div/tr/td/font‘] #需要获取标签font数据的路径 def reset(self): HTMLParser.HTMLParser.reset(self) self._level_stack = [] def handle_starttag(self, tag, attrs): if tag in CustomParser.selected: self._level_stack.append(tag) def handle_endtag(self, tag): if self._level_stack and tag in CustomParser.selected and tag == self._level_stack[-1]: self._level_stack.pop() def handle_data(self, data): #我们将需要获取的数据放到一个list中,同时每一个漏洞的数据会放到一个小的listz中 #如[[名称,CVE,风险],[名称,CVE,风险]],这里拿到的是全部HTML中的数据 if "/".join(self._level_stack) in CustomParser.selected_a and not CustomParser.sigle_cve: print self._level_stack, data.decode(‘gbk‘).encode(‘utf-8‘) CustomParser.sigle_cve.append(data.decode(‘gbk‘).encode(‘utf-8‘).strip()) elif "/".join(self._level_stack) in CustomParser.selected_a: print self._level_stack, data.decode(‘gbk‘).encode(‘utf-8‘).strip() CustomParser.sigle_cve.append(data.decode(‘gbk‘).encode(‘utf-8‘).strip()) elif "/".join(self._level_stack) in CustomParser.selected_font and CustomParser.sigle_cve: print self._level_stack, data.decode(‘gbk‘).encode(‘utf-8‘).strip() CustomParser.sigle_cve.append(data.decode(‘gbk‘).encode(‘utf-8‘).strip()) CustomParser.cve_list.append(CustomParser.sigle_cve) CustomParser.sigle_cve = [] if __name__ == ‘__main__‘: ‘‘‘ 读取,判断是否为高风险,是的打印出来 ‘‘‘ try: fd = open(‘test.html‘,‘r‘) except Exception,error: print error html_string = fd.read() ht = CustomParser() ht.feed(html_string) get_list = ht.cve_list for item in get_list: if item[-1] == ‘高风险‘: print item fd.close()
参考链接:http://crquan.blogbus.com/logs/8269701.html
python 解析html文档模块HTMLPaeser
时间: 2024-11-05 11:27:08