HTMLParser in python

You can know form the name that the HTMLParser is something used to parse
HTML files.  In python, there are two HTMLParsers. One is the HTMLParser
class defined in htmllib module—— htmllib.HTMLParser, the other one is
HTMLParser class defined in HTMLParser module. Let`s see them separately.

htmllib.HTMLParser

This is deprecated since python2.6. The htmllib is removed in python3. But
still, there is something you could know about it. This parser is not
directly concerned with I/O — it must be provided with input in string form via
a method, and makes calls to methods of a “formatter” object in order to produce
output. So you need to do it in below way for instantiation purpose.


>>> from cStringIO import StringIO
>>> from formatter import DumbWriter, AbstractFormatter
>>> from htmllib import HTMLParser
>>> parser = HTMLParser(AbstractFormatter(DumbWriter(StringIO())))
>>>

It is very annoying. All you want to do is parsing a html file, but now you
have to know a lot other things like format, I/O stream etc.

HTMLParser.HTMLParser

In python3 this module is renamed to html.parser. This module does the
samething as htmllib.HTMLParser. The good thing is you do not to import modules
like formatter and cStringIO.  For more information you can go to this URL
:

https://docs.python.org/2.7/library/htmlparser.html?highlight=htmlparser#HTMLParser

Here is some briefly introduction for this module.

See below for a sample code while using this module. You will notice that you
do not need to use formater class or I/O string class.

?





1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

>>> from
HTMLParser import
HTMLParser

>>> class
MyHTMLParser(HTMLParser):

...     def
handle_starttag(self, tag, attrs):

...             print
"Encountered a start tag:", tag

...     def
handle_endtag(self, tag):

...             print
"Encountered an end tag :", tag

...     def
handle_data(self, data):

...              print
"Encountered some data  :", data

...

>>> parser =
MyHTMLParser()

>>> parser.feed(‘<html><head><title>Test</title></head><body><h1>Parse me!</h1></body></html>‘)

Encountered a start tag: html

Encountered a start tag: head

Encountered a start tag: title

Encountered some data  : Test

Encountered an end tag : title

Encountered an end tag : head

Encountered a start tag: body

Encountered a start tag: h1

Encountered some data  : Parse me!

Encountered an end tag : h1

Encountered an end tag : body

Encountered an end tag : html

  

Another case here, in the htmllib.HTMLParser, there was two functions as
below,


HTMLParser.anchor_bgn(href, name, type)
This method is called at the start of an anchor region. The arguments correspond to the attributes of the <A> tag with the same names. The default implementation maintains a list of hyperlinks (defined by the HREF attribute for <A> tags) within the document. The list of hyperlinks is available as the data attribute anchorlist.

HTMLParser.anchor_end()
This method is called at the end of an anchor region. The default implementation adds a textual footnote marker using an index into the list of hyperlinks created by anchor_bgn().

With these two funcitons, htmllib.HTMLParser can easily retrive url links
from a html file. For example:


>>> from urlparse import urlparse
>>> from formatter import DumbWriter, AbstractFormatter
>>> from cStringIO import StringIO
>>> from htmllib import HTMLParser
>>>
>>> def parseAndGetLinks():
... parser = HTMLParser(AbstractFormatter(DumbWriter(StringIO())))
... parser.feed(open(file).read())
... parser.close()
... return parser.anchorlist
...
>>> file=‘/tmp/a.ttt‘
>>> parseAndGetLinks()
[‘http://www.baidu.com/gaoji/preferences.html‘, ‘/‘, ‘https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘/‘, ‘http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=‘, ‘http://tieba.baidu.com/f?kw=&fr=wwwt‘, ‘http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt‘, ‘http://music.baidu.com/search?fr=ps&key=‘, ‘http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=‘, ‘http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=‘, ‘http://map.baidu.com/m?word=&fr=ps01000‘, ‘http://wenku.baidu.com/search?word=&lm=0&od=0‘, ‘http://www.baidu.com/more/‘, ‘javascript:;‘, ‘javascript:;‘, ‘javascript:;‘, ‘http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w‘, ‘http://www.baidu.com/gaoji/preferences.html‘, ‘https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘http://news.baidu.com‘, ‘http://tieba.baidu.com‘, ‘http://zhidao.baidu.com‘, ‘http://music.baidu.com‘, ‘http://image.baidu.com‘, ‘http://v.baidu.com‘, ‘http://map.baidu.com‘, ‘javascript:;‘, ‘javascript:;‘, ‘javascript:;‘, ‘http://baike.baidu.com‘, ‘http://wenku.baidu.com‘, ‘http://www.hao123.com‘, ‘http://www.baidu.com/more/‘, ‘/‘, ‘http://www.baidu.com/cache/sethelp/index.html‘, ‘http://e.baidu.com/?refer=888‘, ‘http://top.baidu.com‘, ‘http://home.baidu.com‘, ‘http://ir.baidu.com‘, ‘/duty/‘]

But in HTMLParser.HTMLParser, we do not have these two functions. Does not
matter, we can define our own.


 1 >>> from HTMLParser import HTMLParser
2 >>> class myHtmlParser(HTMLParser):
3 ... def __init__(self):
4 ... HTMLParser.__init__(self)
5 ... self.anchorlist=[]
6 ... def handle_starttag(self, tag, attrs):
7 ... if tag==‘a‘ or tag==‘A‘:
8 ... for t in attrs :
9 ... if t[0] == ‘href‘ or t[0]==‘HREF‘:
10 ... self.anchorlist.append(t[1])
11 ...
12 >>> file=‘/tmp/a.ttt‘
13 >>> parser=myHtmlParser()
14 >>> parser.feed(open(file).read())
15 >>> parser.anchorlist
16 [‘http://www.baidu.com/gaoji/preferences.html‘, ‘/‘, ‘https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘/‘, ‘http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=‘, ‘http://tieba.baidu.com/f?kw=&fr=wwwt‘, ‘http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt‘, ‘http://music.baidu.com/search?fr=ps&key=‘, ‘http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=‘, ‘http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=‘, ‘http://map.baidu.com/m?word=&fr=ps01000‘, ‘http://wenku.baidu.com/search?word=&lm=0&od=0‘, ‘http://www.baidu.com/more/‘, ‘javascript:;‘, ‘javascript:;‘, ‘javascript:;‘, ‘http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w‘, ‘http://www.baidu.com/gaoji/preferences.html‘, ‘https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘https://passport.baidu.com/v2/?reg&regType=1&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F‘, ‘http://news.baidu.com‘, ‘http://tieba.baidu.com‘, ‘http://zhidao.baidu.com‘, ‘http://music.baidu.com‘, ‘http://image.baidu.com‘, ‘http://v.baidu.com‘, ‘http://map.baidu.com‘, ‘javascript:;‘, ‘javascript:;‘, ‘javascript:;‘, ‘http://baike.baidu.com‘, ‘http://wenku.baidu.com‘, ‘http://www.hao123.com‘, ‘http://www.baidu.com/more/‘, ‘/‘, ‘http://www.baidu.com/cache/sethelp/index.html‘, ‘http://e.baidu.com/?refer=888‘, ‘http://top.baidu.com‘, ‘http://home.baidu.com‘, ‘http://ir.baidu.com‘, ‘/duty/‘]
17 >>>

We look into the second code.

line 3 to line 5 overwrite the __init__ method. The key for this overwriten
is that add an new attribute - anchorlist to our instance.

line 6 to line 10 overwrite the handle_starttag method. First it use if to
check what the tag is. If it is ‘a‘ or ‘A‘,  then use for loop to check its
attribute. Retrieve the href attribute and put the value into the
anchorlist.

Then done.

时间: 2025-01-08 06:36:47

HTMLParser in python的相关文章

python网络爬虫之LXML与HTMLParser

Python lxml包用于解析html和XML文件,个人觉得比beautifulsoup要更灵活些 Lxml中的路径表达式如下: 在下面的表格中,我们已列出了一些路径表达式以及表达式的结果: 路径表示中还可以选取多个路径,使用'|'运算符,比如下面的样子: //book/title | //book/price 选取 book 元素的所有 title 和 price 元素. 下面就来看下lxml的用法:还是用我们之前用过的网站,代码如下: from lxml import etree def

[转载]python模块学习---HTMLParser(解析HTML文档元素)

转自:http://blog.csdn.net/hxsstar/article/details/17241709 HTMLParser是Python自带的模块,使用简单,能够很容易的实现HTML文件的分析. 本文主要简单讲一下HTMLParser的用法. 使用时需要定义一个从类HTMLParser继承的类,重定义函数:handle_starttag( tag, attrs)handle_startendtag( tag, attrs)handle_endtag( tag) 来实现自己需要的功能.

python 爬虫部分解释

example:self.file = www.baidu.com存有baidu站的index.html 1 def parseAndGetLinks(self): # parse HTML, save links 2 self.parser = HTMLParser(AbstractFormatter(DumbWriter(StringIO()))) 3 self.parser.feed(open(self.file).read()) 4 self.parser.close() 5 retur

python解析HTML文档

1.使用HTMLParse解析 HTMLParser是Python自带的模块,使用简单,能够很容易的实现HTML文件的分析.本文主要简单讲一下HTMLParser的用法. 使用时需要定义一个从类HTMLParser继承的类,重定义函数: handle_starttag( tag, attrs) handle_startendtag( tag, attrs) handle_endtag( tag) 来实现自己需要的功能. tag是的html标签,attrs是 (属性,值)元组(tuple)的列表(

Python之HTML的解析(网页抓取一)

http://blog.csdn.net/my2010sam/article/details/14526223 --------------------- 对html的解析是网页抓取的基础,分析抓取的结果找到自己想要的内容或标签以达到抓取的目的. HTMLParser是python用来解析html的模块.它可以分析出html里面的标签.数据等等,是一种处理html的简便途径. HTMLParser采用的是一种事件驱动的模式,当HTMLParser找到一个特定的标记时,它会去调用一个用户定义的函数

Python学习笔记5

1.关于global声明变量的错误例子 I ran across this warning: #!/usr/bin/env python2.3 VAR = 'xxx' if __name__ == '__main__': global VAR VAR = 'yyy' --- OUTPUT: ./var.py:0: SyntaxWarning: name 'VAR' is assigned to before global declaration ---- But, a little twiddl

python中HTMLParser简单理解

找一个网页,例如https://www.python.org/events/python-events/,用浏览器查看源码并复制,然后尝试解析一下HTML,输出Python官网发布的会议时间.名称和地点. 1 from html.parser import HTMLParser 2 from html.entities import name2codepoint 3 4 class MyHTMLParser(HTMLParser): 5 6 in_title = False 7 in_loca

python 解析html基础 HTMLParser库,方法,及代码实例

HTMLParser, a simple lib as html/xhtml parser 官方解释: This module defines a class HTMLParser which serves as the basis for parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.Unlike the parser in htmllib, this parser is not base

Python中的HTMLParser、cookielib抓取和解析网页、从HTML文档中提取链接、图像、文本、Cookies(二)

对搜索引擎.文件索引.文档转换.数据检索.站点备份或迁移等应用程序来说,经常用到对网页(即HTML文件)的解析处理.事实上,通过 Python语言提供的各种模块,我们无需借助Web服务器或者Web浏览器就能够解析和处理HTML文档.本文上篇中,我们介绍了一个可以帮助简化打开 位于本地和Web上的HTML文档的Python模块.在本文中,我们将论述如何使用Python模块来迅速解析在HTML文件中的数据,从而处理特定的 内容,如链接.图像和Cookie等.同时还会介绍如何规范HTML文件的格式标签