Html / XHtml 解析 - Parsing Html and XHtml

  1 Html / XHtml 解析 - Parsing Html and XHtml
  2
  3 HTMLParser 模块
  4     通过 HTMLParser 模块来解析 html 文件通常的做法是, 建立一个 HTMLParser 子类,
  5     然后子类中实现处理的标签(<.>)的方法, 其实现是通过 ‘重写‘ 父类(HTMLParser)的
  6     handle_starttag(), handle_data(), handle_endtag() 等方法.
  7
  8     例子,
  9         解析 htmlsample.html 中 <head> 标签,
 10             <-- htmlsample.html -->  -> 文件内容,
 11                 ‘
 12                 <html>
 13                 <head><title>404 Not Found</title></head>
 14                 <body bgcolor="white">
 15                 <center><h1>404 Not Found</h1></center>
 16                 <hr><center>nginx/1.12.2</center>
 17                 </body>
 18                 </html>
 19                 ‘
 20         from html.parser import HTMLParser
 21         class ParsingHeadT(HTMLParser):
 22             def __init__(self):
 23                 self.headtag =‘‘
 24                 self.parsesemaphore = False
 25                 HTMLParser.__init__(self)
 26
 27             def handle_starttag(self, tag, attrs): # enable semaphore
 28                 if tag == ‘head‘:
 29                     self.parsesemaphore = True
 30
 31             def handle_data(self, data):          # tag process as requirement
 32                 if self.parsesemaphore:
 33                     self.headtag = data
 34
 35             def handle_endtag(self, tag):
 36                 if tag == ‘head‘:
 37                     self.parsesemaphore = False
 38
 39             def getheadtag(self):
 40                 return self.headtag
 41
 42         if __name__ == "__main__":
 43             with open(‘htmlsample.html‘) as FH:
 44                 pht = ParsingHeadT()
 45                 pht.feed(FH.read())    # HTMLParser will invoke the replaced methods
 46                                        # handle_starttag, handle_data and handle_endtag
 47                 print("Head Tag : %s" % pht.getheadtag())
 48
 49         output,
 50            Head Tag : 404 Not Found
 51
 52     上例是一个简单完成的 html 文本, 然而在实际生产中是有一些实现情况要考虑和处理的,
 53     比如 html 中的特殊字符 &copy (copyright 符号), &amp(& 逻辑与符号) 等,
 54         对于这种情况, 之前的做法是需要重写父类的 handle_entityref() 来处理,
 55             HTMLParser.handle_entityref(name)¶
 56                 This method is called to process a named character reference of the form
 57                 &name; (e.g. &gt;), where name is a general entity reference (e.g. ‘gt‘).
 58                 This method is never called if convert_charrefs is True.
 59
 60     字符转换 也是一种需要注意的情况, 比如 十进制 decimal 和 十六进制 hexadecimal 字符的转换.
 61         HTMLParser.handle_charref(name)
 62             This method is called to process decimal and hexadecimal numeric character
 63             references of the form &#NNN; and &#xNNN;. For example, the decimal equivalent
 64             for &gt; is >, whereas the hexadecimal is > in this case the method
 65             will receive ‘62‘ or ‘x3E‘. This method is never called if convert_charrefs is True.
 66
 67     Note,
 68         幸运的是,以上情况在 python 3 已经能很好得帮我们处理了. 还是使用上例, 现在我们在 htmlsample.html
 69         <head> tag 中加入一些特殊字符来看看.
 70             <-- htmlsample.html -->
 71             <html>
 72             <head><title>&#62 &#x3E 404 &copy Not &gt Found & </title></head>
 73             <body bgcolor="white">
 74             <center><h1>404 Not Found</h1></center>
 75             <hr><center>nginx/1.12.2</center>
 76             </body>
 77             </html>
 78
 79         上例 Output,
 80                 Head Tag : > > 404 © Not > Found &
 81                 从运行结果可以看出, 在 python 3 中上例能够很好的处理特殊字符的情况.
 82
 83     然而, 在 html 的代码中存在一类 ‘非对称‘的标签, 如 <p>, <li> 等, 当我们试图使用上面的例子
 84     去处理这类非对称标签的时候发现, 这类标签并不能被上例正确解析. 这时我们需要扩展上例的 code 使
 85     其能够正确解析这些‘非对称‘标签.
 86         先扩展一下儿 htmlsample.html, 以 <li> 标签为例,
 87         <-- htmlsample.html -->
 88         <html>
 89         <head><title>&#62 &#x3E 404 &copy Not &gt Found &</title>
 90         <body bgcolor="white">
 91         <center><h1>404 Not Found</h1></center>
 92         <hr><center>nginx/1.12.2</center>
 93         <ul>
 94             <li> First Reason
 95             <li> Second Reason
 96         </body>
 97         </html>
 98
 99         htmlsample.html 文件是可以被浏览器渲染的, 然而 htmlsample.html 中 <head> 和 <ul> 标签
100         没有对应的结束 tag, <li> 为非对称的 tag. 现在来向之前的例子添加一些逻辑来处理这些问题.
101
102         例,
103             from html.parser import HTMLParser
104             class Parser(HTMLParser):
105                 def __init__(self):
106                     self.taglevels = []     # track anchor
107                     self.tags =[‘head‘,‘ul‘,‘li‘]
108                     self.parsesemaphore = False
109                     self.data = ‘‘
110                     HTMLParser.__init__(self)
111
112                 def handle_starttag(self, tag, attrs): # enable semaphore
113                     if len(self.taglevels) and self.taglevels[-1] == tag:
114                         self.handle_endtag(tag)
115                     self.taglevels.append(tag)
116
117                     if tag in self.tags:
118                         self.parsesemaphore = True
119
120                 def handle_data(self, data):          # tag process as requirement
121                     if self.parsesemaphore:
122                         self.data += data
123
124                 def handle_endtag(self, tag):
125                     self.parsesemaphore = False
126
127                 def gettag(self):
128                     return self.data
129
130             if __name__ == "__main__":
131                 with open(‘htmlsample.html‘) as FH:
132                     pht = Parser()
133                     pht.feed(FH.read())    # HTMLParser will invoke the replaced methods
134                                            # handle_starttag, handle_data and handle_endtag
135                     print("Head Tag : %s" % pht.gettag())
136
137             Output,
138                  Head Tag : > > 404 © Not > Found &
139                  First Reason
140                  Second Reason
141
142 Reference,
143     https://docs.python.org/3.6/library/html.parser.html?highlight=htmlparse#html.parser.HTMLParser.handle_entityref
144
145 Appendix,
146     The example given by python Doc,
147         from html.parser import HTMLParser
148         from html.entities import name2codepoint
149
150         class MyHTMLParser(HTMLParser):
151             def handle_starttag(self, tag, attrs):
152                 print("Start tag:", tag)
153                 for attr in attrs:
154                     print("     attr:", attr)
155
156             def handle_endtag(self, tag):
157                 print("End tag  :", tag)
158
159             def handle_data(self, data):
160                 print("Data     :", data)
161
162             def handle_comment(self, data):
163                 print("Comment  :", data)
164
165             def handle_entityref(self, name):
166                 c = chr(name2codepoint[name])
167                 print("Named ent:", c)
168
169             def handle_charref(self, name):
170                 if name.startswith(‘x‘):
171                     c = chr(int(name[1:], 16))
172                 else:
173                     c = chr(int(name))
174                 print("Num ent  :", c)
175
176             def handle_decl(self, data):
177                 print("Decl     :", data)
178
179         parser = MyHTMLParser()
180
181     Output,
182         Parsing a doctype:
183
184     # >>> parser.feed(‘<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" ‘
185     ...             ‘"http://www.w3.org/TR/html4/strict.dtd">‘)
186         Decl     : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
187         Parsing an element with a few attributes and a title:
188
189
190     # >>> parser.feed(‘<img src="python-logo.png" alt="The Python logo">‘)
191         Start tag: img
192              attr: (‘src‘, ‘python-logo.png‘)
193              attr: (‘alt‘, ‘The Python logo‘)
194
195     # >>> parser.feed(‘<h1>Python</h1>‘)
196         Start tag: h1
197         Data     : Python
198         End tag  : h1
199         The content of script and style elements is returned as is, without further parsing:
200
201
202     # >>> parser.feed(‘<style type="text/css">#python { color: green }</style>‘)
203         Start tag: style
204              attr: (‘type‘, ‘text/css‘)
205         Data     : #python { color: green }
206         End tag  : style
207
208     # >>> parser.feed(‘<script type="text/javascript">‘
209     ...             ‘alert("<strong>hello!</strong>");</script>‘)
210         Start tag: script
211              attr: (‘type‘, ‘text/javascript‘)
212         Data     : alert("<strong>hello!</strong>");
213         End tag  : script
214         Parsing comments:
215
216     # >>> parser.feed(‘<!-- a comment -->‘
217     ...             ‘<!--[if IE 9]>IE-specific content<![endif]-->‘)
218         Comment  :  a comment
219         Comment  : [if IE 9]>IE-specific content<![endif]
220         Parsing named and numeric character references and converting them to the correct
221         char (note: these 3 references are all equivalent to ‘>‘):
222
223     # >>> parser.feed(‘&gt;>>‘)
224         Named ent: >
225         Num ent  : >
226         Num ent  : >
227         Feeding incomplete chunks to feed() works, but handle_data() might be called more
228         than once (unless convert_charrefs is set to True):
229
230     # >>> for chunk in [‘<sp‘, ‘an>buff‘, ‘ered ‘, ‘text</s‘, ‘pan>‘]:
231     ...     parser.feed(chunk)
232         Start tag: span
233         Data     : buff
234         Data     : ered
235         Data     : text
236         End tag  : span
237         Parsing invalid HTML (e.g. unquoted attributes) also works:
238
239     # >>> parser.feed(‘<p><a class=link href=#main>tag soup</p ></a>‘)
240         Start tag: p
241         Start tag: a
242              attr: (‘class‘, ‘link‘)
243              attr: (‘href‘, ‘#main‘)
244         Data     : tag soup
245         End tag  : p
246         End tag  : a

时间： 2024-11-06 19:41:31

Html / XHtml 解析 - Parsing Html and XHtml的相关文章

XHTML教程（1）——XHTML 简介

XHTML 是更严格更纯净的 HTML 代码. XHTML 是什么? XHTML 指可扩展超文本标签语言(EXtensible HyperText Markup Language). XHTML 的目标是取代 HTML. XHTML 与 HTML 4.01 几乎是相同的. XHTML 是更严格更纯净的 HTML 版本. XHTML 是作为一种 XML 应用被重新定义的 HTML. XHTML 是一个 W3C 标准. 之前应该掌握的知识在继续学习本教程之前,你应该对下列知识有一个基本的理解: H

【每日壹闻】深入浅出HTML与XHTML的区别-------HTML与XHTML

可扩展超文本标记语言XHTML(eXtensible HyperText Markup Language),是HTML 4.01的第一个修订版本,是「3种HTML4文件根据XML1.0标准重组」而成的.也就是说是,XHTML是HTML 4.01和XML1.0的杂交.由于XHTML1.0是基于HTML4.01的,并没有引入任何新标签或属性(XHTML可以看作是HTML的一个子集),表现方式与超文本标记语言HTML类似,只是语法上更加严格,几乎所有的网页浏览器在正确解析HTML的同时,可兼容XHTM

《head first html与css、xhtml》——第14章xhtml表单

2015-04-10 21:13:38 1.<input>表单元素,是空元素,是内联元素.type属性可以有好多值,“text","radio","checkbox"(复选框,name属性值设为一样,不知道为啥,以后实践). 2.<textarea>不是一个空元素,所以它有结束标记. 3.<select>元素设置name属性之后,它的<option>项就不会再设置name属性了,应为发送给服务器时,就是nam

XHTML+CSS基础知识（一）：基础知识

1.什么是W3C标准? W3C标准其实并不是某一项标准,而是一些列标准的集合. 它主要包括三个方面:结构标准(XHTML.XML),表现标准(CSS),动作标准(JavaScript). 它的本意是希望能够在网页上实现结构和表现的彻底分离.它要求网页的结构要遵循XHTML规范. 2.XHTML规范包含哪些内容? 文档方面:必须定义文档类型(DTD)和命名空间其中命名空间是XML语言当中的一种规范,没有什么实际意义,此处的命名空间主要用于标注文档类型的作者,即W3C标准委员会. 标签方面:所有标

XHTML代码规则&手工html转换xhtml

XHTML规则 XHTML是XML得一个应用,它遵守XML得规范和要求.从技术角度上讲.这些语法规则是由XML规范定义的. XML文档必须遵守的规则使得生成工具以解析文档变得更容易.这些规则也使得XML更容易处理.规则很简单,并且使用过HTML得人对于其中得一些规则应该比较熟悉.我们把XML得规则定义为下列两类: ∷XML语法规则,指的是定义了基本语法要求的规则. ∷XML文档规则,指的是管理着基本文档要求的规则. ∷XHTML语法规则 XHTML需要遵守的许多语法规则,比HTML文档需要遵守的

HTML和XHTML区别

HTML和XHTML XHTML(eXtensible HyperText Markup Language,可扩展超文本标记语言)是将HTML(HyperText Markup Language,超文本标记语言)作为XML应用而重新定义的标准. 在HTML5标准中定义了两种语法(HTML 4.01和XHTML 1).在标准中可以通过定义一个特殊的DOCTYPE标签来XHTML,但是没有浏览器去实现这一标准.所以最后HTML5标准推翻了这个决定. 可以通过使用MIME类型(包含在HTTP请求中的C

1; XHTML 基本知识

万维网是我们这个时代最重要的信息传播手段.几乎任何人都可以创建自己的网站,然后把它发布在因特网上.一些网页属于企业,提供销售服务:另一些网页属于个人,用来分享信息.你可以自己决定网页的内容和风格.所有网页都要用某种形式的 HTML 来编写.HTML 可以对文本进行格式化,添加图形.声音和视频,并且可以将它保存为所有计算机都可以读取的文本文件.学习和掌握 HTML 并不困难.编写 HTML 并不是一个令人头昏的复杂过程,只需仔细输入并保持一致性.可以在几分钟内建立一个简单的 HTML 页面并让它运

XML、HTML、XHTML的关系1

标记语言 XML.HTML.XHTML这三者都有ML.ML(Markup Language)标记语言在维基百科中的解释是: 一种将文本以及文本相关的信息结合起来,展示出关于文档结构和数据处理细节的计算机文字编码.与文本相关的其他信息(包括文本的结构和表示信息等)与原本的文本结合在一起,但是使用标记(markup)进行标识. 用html举个例子: <h1>我爱我家</h1> 上面的例子中的"我爱我家"就是文本,与文本相关的其他信息"这段文本是个标题&qu

XHTML

XHTMLXHTML基本语法规则1,元素类容<img>:定义一个图像元素,其属性src告诉浏览器该图像的来源例:<img src="123.jpg"> 2,元素类容<dd>:给<img>图像元素一个解释例:<dd>这是一个女孩子的照片</dd> 3,元素类容<h1>:关键字的级别,定义标题大小(1最大,6最小)例:<h1>学生管理</h1> 4,元素类容<p>:段落标