Python的Web编程[0] -> Web客户端[1] -> Web 页面解析

 Web页面解析 / Web page parsing



1 HTMLParser解析

下面介绍一种基本的Web页面HTML解析的方式,主要是利用Python自带的html.parser模块进行解析。其主要步骤为:

  1. 创建一个新的Parser类,继承HTMLParser类;
  2. 重载handler_starttag等方法,实现指定功能;
  3. 实例化新的Parser并将HTML文本feed给类实例。

完整代码

 1 from html.parser import HTMLParser
 2
 3 # An HTMLParser instance is fed HTML data and calls handler methods when start tags, end tags, text, comments, and other markup elements are encountered
 4 # Subclass HTMLParser and override its methods to implement the desired behavior
 5
 6 class MyHTMLParser(HTMLParser):
 7     # attrs is the attributes set in HTML start tag
 8     def handle_starttag(self, tag, attrs):
 9         print(‘Encountered a start tag:‘, tag)
10         for attr in attrs:
11             print(‘     attr:‘, attr)
12
13     def handle_endtag(self, tag):
14         print(‘Encountered an end tag :‘, tag)
15
16     def handle_data(self, data):
17         print(‘Encountered some data  :‘, data)
18
19 parser = MyHTMLParser()
20 parser.feed(‘<html><head><title>Test</title></head>‘
21             ‘<body><h1>Parse me!</h1></body></html>‘
22             ‘<img src="python-logo.png" alt="The Python logo">‘)

代码中首先对模块进行导入,派生一个新的 Parser 类,随后重载方法,当遇到起始tag时,输出并判断是否有定义属性,有则输出,遇到终止tag与数据时同样输出。

Note: handle_starttag()函数的attrs为由该起始tag属性组成的元组元素列表,即列表中包含元组,元组中第一个参数为属性名,第二个参数为属性值。

输出结果

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html
Encountered a start tag: img
     attr: (‘src‘, ‘python-logo.png‘)
     attr: (‘alt‘, ‘The Python logo‘)  

从输出中可以看到,解析器将HTML文本进行了解析,并且输出了tag中包含的属性。

2 BeautifulSoup解析

接下来介绍一种第三方的HTML页面解析包BeautifulSoup,同时与HTMLParser进行对比。

首先需要进行BeautifulSoup的安装,安装方式如下,

pip install beautifulsoup4    

完整代码

 1 from html.parser import HTMLParser
 2 from io import StringIO
 3 from urllib import request
 4
 5 from bs4 import BeautifulSoup, SoupStrainer
 6 from html5lib import parse, treebuilders
 7
 8
 9 URLs = (‘http://python.org‘,
10         ‘http://www.baidu.com‘)
11
12 def output(x):
13     print(‘\n‘.join(sorted(set(x))))
14
15 def simple_beau_soup(url, f):
16     ‘simple_beau_soup() - use BeautifulSoup to parse all tags to get anchors‘
17     # BeautifulSoup returns a BeautifulSoup instance
18     # find_all function returns a bs4.element.ResultSet instance,
19     # which contains bs4.element.Tag instances,
20     # use tag[‘attr‘] to get attribute of tag
21     output(request.urljoin(url, x[‘href‘]) for x in BeautifulSoup(markup=f, features=‘html5lib‘).find_all(‘a‘))
22
23 def faster_beau_soup(url, f):
24     ‘faster_beau_soup() - use BeautifulSoup to parse only anchor tags‘
25     # Add find_all(‘a‘) function
26     output(request.urljoin(url, x[‘href‘]) for x in BeautifulSoup(markup=f, features=‘html5lib‘, parse_only=SoupStrainer(‘a‘)).find_all(‘a‘))
27
28 def htmlparser(url, f):
29     ‘htmlparser() - use HTMLParser to parse anchor tags‘
30     class AnchorParser(HTMLParser):
31         def handle_starttag(self, tag, attrs):
32             if tag != ‘a‘:
33                 return
34             if not hasattr(self, ‘data‘):
35                 self.data = []
36             for attr in attrs:
37                 if attr[0] == ‘href‘:
38                     self.data.append(attr[1])
39     parser = AnchorParser()
40     parser.feed(f.read())
41     output(request.urljoin(url, x) for x in parser.data)
42     print(‘DONE‘)
43
44 def html5libparse(url, f):
45     ‘html5libparse() - use html5lib to parser anchor tags‘
46     #output(request.urljoin(url, x.attributes[‘href‘]) for x in parse(f) if isinstance(x, treebuilders.etree.Element) and x.name == ‘a‘)
47
48 def process(url, data):
49     print(‘\n*** simple BeauSoupParser‘)
50     simple_beau_soup(url, data)
51     data.seek(0)
52     print(‘\n*** faster BeauSoupParser‘)
53     faster_beau_soup(url, data)
54     data.seek(0)
55     print(‘\n*** HTMLParser‘)
56     htmlparser(url, data)
57     data.seek(0)
58     print(‘\n*** HTML5lib‘)
59     html5libparse(url, data)
60     data.seek(0)
61
62 if __name__==‘__main__‘:
63     for url in URLs:
64         f = request.urlopen(url)
65         data = StringIO(f.read().decode())
66         f.close()
67         process(url, data)

分段解释

首先将所需模块进行导入,其中StringIO模块用来实现字符串缓存容器,

 1 from html.parser import HTMLParser
 2 from io import StringIO
 3 from urllib import request
 4
 5 from bs4 import BeautifulSoup, SoupStrainer
 6 from html5lib import parse, treebuilders
 7
 8
 9 URLs = (‘http://python.org‘,
10         ‘http://www.baidu.com‘)

接着定义一个输出函数,利用集合消除重复参数同时进行换行分离。

1 def output(x):
2     print(‘\n‘.join(sorted(set(x))))

此处定义一个简单的bs解析函数,首先利用BeautifulSoup类传入HTML文本以及features(新版提示使用‘html5lib’),生成一个BeautifulSoup实例,再利用find_all()函数返回所有tag为‘a’的链接锚集合类(bs4.element.Tag),通过Tag获取href属性,最后利用urljoin函数生成链接并输出。

1 def simple_beau_soup(url, f):
2     ‘simple_beau_soup() - use BeautifulSoup to parse all tags to get anchors‘
3     # BeautifulSoup returns a BeautifulSoup instance
4     # find_all function returns a bs4.element.ResultSet instance,
5     # which contains bs4.element.Tag instances,
6     # use tag[‘attr‘] to get attribute of tag
7     output(request.urljoin(url, x[‘href‘]) for x in BeautifulSoup(markup=f, features=‘html5lib‘).find_all(‘a‘))

接着定义一个新的解析函数,这个函数可以通过参数传入parse_only来设置需要解析的锚标签,从而加快解析的速度。

Note: 这部分存在一个问题,当使用‘html5lib’特性时,是不支持parse_only参数的,因此会对整个标签进行搜索。有待解决。

1 def faster_beau_soup(url, f):
2     ‘faster_beau_soup() - use BeautifulSoup to parse only anchor tags‘
3     # Add find_all(‘a‘) function
4     output(request.urljoin(url, x[‘href‘]) for x in BeautifulSoup(markup=f, features=‘html5lib‘, parse_only=SoupStrainer(‘a‘)).find_all(‘a‘))

再定义一个用html方式进行解析的函数,可参见前节使用方式,首先建立一个锚解析的类,在遇到起始标签时,判断是否为‘a’锚,在进入时判断是否有data属性,没有的话初始化属性为空,随后对attrs参数遍历,获取href参数。最后生成实例并feed数据。

 1 def htmlparser(url, f):
 2     ‘htmlparser() - use HTMLParser to parse anchor tags‘
 3     class AnchorParser(HTMLParser):
 4         def handle_starttag(self, tag, attrs):
 5             if tag != ‘a‘:
 6                 return
 7             if not hasattr(self, ‘data‘):
 8                 self.data = []
 9             for attr in attrs:
10                 if attr[0] == ‘href‘:
11                     self.data.append(attr[1])
12     parser = AnchorParser()
13     parser.feed(f.read())
14     output(request.urljoin(url, x) for x in parser.data)
15     print(‘DONE‘)

最后定义一个process函数,对于传入的data,每次使用完后都需要seek(0)将光标移回初始。

 1 def process(url, data):
 2     print(‘\n*** simple BeauSoupParser‘)
 3     simple_beau_soup(url, data)
 4     data.seek(0)
 5     print(‘\n*** faster BeauSoupParser‘)
 6     faster_beau_soup(url, data)
 7     data.seek(0)
 8     print(‘\n*** HTMLParser‘)
 9     htmlparser(url, data)
10     data.seek(0)
11     print(‘\n*** HTML5lib‘)
12     html5libparse(url, data)
13     data.seek(0)

最终解析的结果为网页内所有的链接。

1 if __name__==‘__main__‘:
2     for url in URLs:
3         f = request.urlopen(url)
4         data = StringIO(f.read().decode())
5         f.close()
6         process(url, data)

运行输出结果

*** simple BeauSoupParser
http://blog.python.org
http://bottlepy.org
http://brochure.getpython.info/
http://buildbot.net/
http://docs.python.org/3/tutorial/
http://docs.python.org/3/tutorial/controlflow.html
http://docs.python.org/3/tutorial/controlflow.html#defining-functions
http://docs.python.org/3/tutorial/introduction.html#lists
http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator
http://feedproxy.google.com/~r/PythonInsider/~3/TmC0nYZBrz4/python-364rc1-and-370a3-now-available.html
http://feedproxy.google.com/~r/PythonInsider/~3/rMFQQbvrekU/python-364-is-now-available.html
http://feedproxy.google.com/~r/PythonInsider/~3/ubEu3XCqoFM/python-370a2-now-available-for-testing.html
http://feedproxy.google.com/~r/PythonInsider/~3/xUpvN2wKt2s/python-364rc1-and-370a3-now-available.html
http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html
http://flask.pocoo.org/
http://ipython.org
http://jobs.python.org
http://pandas.pydata.org/
http://planetpython.org/
http://plus.google.com/+Python
http://pycon.blogspot.com/
http://pyfound.blogspot.com/
http://python.org
http://python.org#content
http://python.org#python-network
http://python.org#site-map
http://python.org#top
http://python.org/
http://python.org/about/
http://python.org/about/apps
http://python.org/about/apps/
http://python.org/about/gettingstarted/
http://python.org/about/help/
http://python.org/about/legal/
http://python.org/about/quotes/
http://python.org/about/success/
http://python.org/about/success/#arts
http://python.org/about/success/#business
http://python.org/about/success/#education
http://python.org/about/success/#engineering
http://python.org/about/success/#government
http://python.org/about/success/#scientific
http://python.org/about/success/#software-development
http://python.org/accounts/login/
http://python.org/accounts/signup/
http://python.org/blogs/
http://python.org/community/
http://python.org/community/awards
http://python.org/community/diversity/
http://python.org/community/forums/
http://python.org/community/irc/
http://python.org/community/lists/
http://python.org/community/logos/
http://python.org/community/merchandise/
http://python.org/community/sigs/
http://python.org/community/workshops/
http://python.org/dev/
http://python.org/dev/core-mentorship/
http://python.org/dev/peps/
http://python.org/dev/peps/peps.rss
http://python.org/doc/
http://python.org/doc/av
http://python.org/doc/essays/
http://python.org/download/alternatives
http://python.org/download/other/
http://python.org/downloads/
http://python.org/downloads/mac-osx/
http://python.org/downloads/release/python-2714/
http://python.org/downloads/release/python-364/
http://python.org/downloads/source/
http://python.org/downloads/windows/
http://python.org/events/
http://python.org/events/calendars/
http://python.org/events/python-events
http://python.org/events/python-events/543/
http://python.org/events/python-events/611/
http://python.org/events/python-events/past/
http://python.org/events/python-user-group/
http://python.org/events/python-user-group/605/
http://python.org/events/python-user-group/619/
http://python.org/events/python-user-group/620/
http://python.org/events/python-user-group/past/
http://python.org/jobs/
http://python.org/privacy/
http://python.org/psf-landing/
http://python.org/psf/
http://python.org/psf/donations/
http://python.org/psf/sponsorship/sponsors/
http://python.org/shell/
http://python.org/success-stories/
http://python.org/success-stories/industrial-light-magic-runs-python/
http://python.org/users/membership/
http://roundup.sourceforge.net/
http://tornadoweb.org
http://trac.edgewall.org/
http://twitter.com/ThePSF
http://wiki.python.org/moin/Languages
http://wiki.python.org/moin/TkInter
http://www.ansible.com
http://www.djangoproject.com/
http://www.facebook.com/pythonlang?fref=ts
http://www.pylonsproject.org/
http://www.riverbankcomputing.co.uk/software/pyqt/intro
http://www.saltstack.com
http://www.scipy.org
http://www.web2py.com/
http://www.wxpython.org/
https://bugs.python.org/
https://devguide.python.org/
https://docs.python.org
https://docs.python.org/3/license.html
https://docs.python.org/faq/
https://github.com/python/pythondotorg/issues
https://kivy.org/
https://mail.python.org/mailman/listinfo/python-dev
https://pypi.python.org/
https://status.python.org/
https://wiki.gnome.org/Projects/PyGObject
https://wiki.python.org/moin/
https://wiki.python.org/moin/BeginnersGuide
https://wiki.python.org/moin/Python2orPython3
https://wiki.python.org/moin/PythonBooks
https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event
https://wiki.qt.io/PySide
https://www.openstack.org
https://www.python.org/psf/codeofconduct/
javascript:;

*** faster BeauSoupParser

Warning (from warnings module):
  File "C:\Python35\lib\site-packages\bs4\builder\_html5lib.py", line 63
    warnings.warn("You provided a value for parse_only, but the html5lib tree builder doesn‘t support parse_only. The entire document will be parsed.")
UserWarning: You provided a value for parse_only, but the html5lib tree builder doesn‘t support parse_only. The entire document will be parsed.
http://blog.python.org
http://bottlepy.org
http://brochure.getpython.info/
http://buildbot.net/
http://docs.python.org/3/tutorial/
http://docs.python.org/3/tutorial/controlflow.html
http://docs.python.org/3/tutorial/controlflow.html#defining-functions
http://docs.python.org/3/tutorial/introduction.html#lists
http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator
http://feedproxy.google.com/~r/PythonInsider/~3/TmC0nYZBrz4/python-364rc1-and-370a3-now-available.html
http://feedproxy.google.com/~r/PythonInsider/~3/rMFQQbvrekU/python-364-is-now-available.html
http://feedproxy.google.com/~r/PythonInsider/~3/ubEu3XCqoFM/python-370a2-now-available-for-testing.html
http://feedproxy.google.com/~r/PythonInsider/~3/xUpvN2wKt2s/python-364rc1-and-370a3-now-available.html
http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html
http://flask.pocoo.org/
http://ipython.org
http://jobs.python.org
http://pandas.pydata.org/
http://planetpython.org/
http://plus.google.com/+Python
http://pycon.blogspot.com/
http://pyfound.blogspot.com/
http://python.org
http://python.org#content
http://python.org#python-network
http://python.org#site-map
http://python.org#top
http://python.org/
http://python.org/about/
http://python.org/about/apps
http://python.org/about/apps/
http://python.org/about/gettingstarted/
http://python.org/about/help/
http://python.org/about/legal/
http://python.org/about/quotes/
http://python.org/about/success/
http://python.org/about/success/#arts
http://python.org/about/success/#business
http://python.org/about/success/#education
http://python.org/about/success/#engineering
http://python.org/about/success/#government
http://python.org/about/success/#scientific
http://python.org/about/success/#software-development
http://python.org/accounts/login/
http://python.org/accounts/signup/
http://python.org/blogs/
http://python.org/community/
http://python.org/community/awards
http://python.org/community/diversity/
http://python.org/community/forums/
http://python.org/community/irc/
http://python.org/community/lists/
http://python.org/community/logos/
http://python.org/community/merchandise/
http://python.org/community/sigs/
http://python.org/community/workshops/
http://python.org/dev/
http://python.org/dev/core-mentorship/
http://python.org/dev/peps/
http://python.org/dev/peps/peps.rss
http://python.org/doc/
http://python.org/doc/av
http://python.org/doc/essays/
http://python.org/download/alternatives
http://python.org/download/other/
http://python.org/downloads/
http://python.org/downloads/mac-osx/
http://python.org/downloads/release/python-2714/
http://python.org/downloads/release/python-364/
http://python.org/downloads/source/
http://python.org/downloads/windows/
http://python.org/events/
http://python.org/events/calendars/
http://python.org/events/python-events
http://python.org/events/python-events/543/
http://python.org/events/python-events/611/
http://python.org/events/python-events/past/
http://python.org/events/python-user-group/
http://python.org/events/python-user-group/605/
http://python.org/events/python-user-group/619/
http://python.org/events/python-user-group/620/
http://python.org/events/python-user-group/past/
http://python.org/jobs/
http://python.org/privacy/
http://python.org/psf-landing/
http://python.org/psf/
http://python.org/psf/donations/
http://python.org/psf/sponsorship/sponsors/
http://python.org/shell/
http://python.org/success-stories/
http://python.org/success-stories/industrial-light-magic-runs-python/
http://python.org/users/membership/
http://roundup.sourceforge.net/
http://tornadoweb.org
http://trac.edgewall.org/
http://twitter.com/ThePSF
http://wiki.python.org/moin/Languages
http://wiki.python.org/moin/TkInter
http://www.ansible.com
http://www.djangoproject.com/
http://www.facebook.com/pythonlang?fref=ts
http://www.pylonsproject.org/
http://www.riverbankcomputing.co.uk/software/pyqt/intro
http://www.saltstack.com
http://www.scipy.org
http://www.web2py.com/
http://www.wxpython.org/
https://bugs.python.org/
https://devguide.python.org/
https://docs.python.org
https://docs.python.org/3/license.html
https://docs.python.org/faq/
https://github.com/python/pythondotorg/issues
https://kivy.org/
https://mail.python.org/mailman/listinfo/python-dev
https://pypi.python.org/
https://status.python.org/
https://wiki.gnome.org/Projects/PyGObject
https://wiki.python.org/moin/
https://wiki.python.org/moin/BeginnersGuide
https://wiki.python.org/moin/Python2orPython3
https://wiki.python.org/moin/PythonBooks
https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event
https://wiki.qt.io/PySide
https://www.openstack.org
https://www.python.org/psf/codeofconduct/
javascript:;

*** HTMLParser
http://blog.python.org
http://bottlepy.org
http://brochure.getpython.info/
http://buildbot.net/
http://docs.python.org/3/tutorial/
http://docs.python.org/3/tutorial/controlflow.html
http://docs.python.org/3/tutorial/controlflow.html#defining-functions
http://docs.python.org/3/tutorial/introduction.html#lists
http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator
http://feedproxy.google.com/~r/PythonInsider/~3/TmC0nYZBrz4/python-364rc1-and-370a3-now-available.html
http://feedproxy.google.com/~r/PythonInsider/~3/rMFQQbvrekU/python-364-is-now-available.html
http://feedproxy.google.com/~r/PythonInsider/~3/ubEu3XCqoFM/python-370a2-now-available-for-testing.html
http://feedproxy.google.com/~r/PythonInsider/~3/xUpvN2wKt2s/python-364rc1-and-370a3-now-available.html
http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html
http://flask.pocoo.org/
http://ipython.org
http://jobs.python.org
http://pandas.pydata.org/
http://planetpython.org/
http://plus.google.com/+Python
http://pycon.blogspot.com/
http://pyfound.blogspot.com/
http://python.org
http://python.org#content
http://python.org#python-network
http://python.org#site-map
http://python.org#top
http://python.org/
http://python.org/about/
http://python.org/about/apps
http://python.org/about/apps/
http://python.org/about/gettingstarted/
http://python.org/about/help/
http://python.org/about/legal/
http://python.org/about/quotes/
http://python.org/about/success/
http://python.org/about/success/#arts
http://python.org/about/success/#business
http://python.org/about/success/#education
http://python.org/about/success/#engineering
http://python.org/about/success/#government
http://python.org/about/success/#scientific
http://python.org/about/success/#software-development
http://python.org/accounts/login/
http://python.org/accounts/signup/
http://python.org/blogs/
http://python.org/community/
http://python.org/community/awards
http://python.org/community/diversity/
http://python.org/community/forums/
http://python.org/community/irc/
http://python.org/community/lists/
http://python.org/community/logos/
http://python.org/community/merchandise/
http://python.org/community/sigs/
http://python.org/community/workshops/
http://python.org/dev/
http://python.org/dev/core-mentorship/
http://python.org/dev/peps/
http://python.org/dev/peps/peps.rss
http://python.org/doc/
http://python.org/doc/av
http://python.org/doc/essays/
http://python.org/download/alternatives
http://python.org/download/other/
http://python.org/downloads/
http://python.org/downloads/mac-osx/
http://python.org/downloads/release/python-2714/
http://python.org/downloads/release/python-364/
http://python.org/downloads/source/
http://python.org/downloads/windows/
http://python.org/events/
http://python.org/events/calendars/
http://python.org/events/python-events
http://python.org/events/python-events/543/
http://python.org/events/python-events/611/
http://python.org/events/python-events/past/
http://python.org/events/python-user-group/
http://python.org/events/python-user-group/605/
http://python.org/events/python-user-group/619/
http://python.org/events/python-user-group/620/
http://python.org/events/python-user-group/past/
http://python.org/jobs/
http://python.org/privacy/
http://python.org/psf-landing/
http://python.org/psf/
http://python.org/psf/donations/
http://python.org/psf/sponsorship/sponsors/
http://python.org/shell/
http://python.org/success-stories/
http://python.org/success-stories/industrial-light-magic-runs-python/
http://python.org/users/membership/
http://roundup.sourceforge.net/
http://tornadoweb.org
http://trac.edgewall.org/
http://twitter.com/ThePSF
http://wiki.python.org/moin/Languages
http://wiki.python.org/moin/TkInter
http://www.ansible.com
http://www.djangoproject.com/
http://www.facebook.com/pythonlang?fref=ts
http://www.pylonsproject.org/
http://www.riverbankcomputing.co.uk/software/pyqt/intro
http://www.saltstack.com
http://www.scipy.org
http://www.web2py.com/
http://www.wxpython.org/
https://bugs.python.org/
https://devguide.python.org/
https://docs.python.org
https://docs.python.org/3/license.html
https://docs.python.org/faq/
https://github.com/python/pythondotorg/issues
https://kivy.org/
https://mail.python.org/mailman/listinfo/python-dev
https://pypi.python.org/
https://status.python.org/
https://wiki.gnome.org/Projects/PyGObject
https://wiki.python.org/moin/
https://wiki.python.org/moin/BeginnersGuide
https://wiki.python.org/moin/Python2orPython3
https://wiki.python.org/moin/PythonBooks
https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event
https://wiki.qt.io/PySide
https://www.openstack.org
https://www.python.org/psf/codeofconduct/
javascript:;
DONE

*** HTML5lib

*** simple BeauSoupParser
http://e.baidu.com/?refer=888
http://home.baidu.com
http://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=
http://ir.baidu.com
http://jianyi.baidu.com/
http://map.baidu.com
http://map.baidu.com/m?word=&fr=ps01000
http://music.baidu.com/search?fr=ps&ie=utf-8&key=
http://news.baidu.com
http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=
http://tieba.baidu.com
http://tieba.baidu.com/f?kw=&fr=wwwt
http://v.baidu.com
http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=
http://wenku.baidu.com/search?word=&lm=0&od=0&ie=utf-8
http://www.baidu.com/
http://www.baidu.com/cache/sethelp/help.html
http://www.baidu.com/duty/
http://www.baidu.com/gaoji/preferences.html
http://www.baidu.com/more/
http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001
http://www.hao123.com
http://xueshu.baidu.com
http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt
https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F
javascript:;

*** faster BeauSoupParser
http://e.baidu.com/?refer=888
http://home.baidu.com
http://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=
http://ir.baidu.com
http://jianyi.baidu.com/
http://map.baidu.com
http://map.baidu.com/m?word=&fr=ps01000
http://music.baidu.com/search?fr=ps&ie=utf-8&key=
http://news.baidu.com
http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=
http://tieba.baidu.com
http://tieba.baidu.com/f?kw=&fr=wwwt
http://v.baidu.com
http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=
http://wenku.baidu.com/search?word=&lm=0&od=0&ie=utf-8
http://www.baidu.com/
http://www.baidu.com/cache/sethelp/help.html
http://www.baidu.com/duty/
http://www.baidu.com/gaoji/preferences.html
http://www.baidu.com/more/
http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001
http://www.hao123.com
http://xueshu.baidu.com
http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt
https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F
javascript:;

*** HTMLParser
http://e.baidu.com/?refer=888
http://home.baidu.com
http://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=
http://ir.baidu.com
http://jianyi.baidu.com/
http://map.baidu.com
http://map.baidu.com/m?word=&fr=ps01000
http://music.baidu.com/search?fr=ps&ie=utf-8&key=
http://news.baidu.com
http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=
http://tieba.baidu.com
http://tieba.baidu.com/f?kw=&fr=wwwt
http://v.baidu.com
http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=
http://wenku.baidu.com/search?word=&lm=0&od=0&ie=utf-8
http://www.baidu.com/
http://www.baidu.com/cache/sethelp/help.html
http://www.baidu.com/duty/
http://www.baidu.com/gaoji/preferences.html
http://www.baidu.com/more/
http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11000002000001
http://www.hao123.com
http://xueshu.baidu.com
http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt
https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F
javascript:;
DONE

*** HTML5lib

参考链接



《Python 核心编程 第3版》

原文地址:https://www.cnblogs.com/stacklike/p/8244925.html

时间: 2024-11-05 13:36:34

Python的Web编程[0] -> Web客户端[1] -> Web 页面解析的相关文章

Web 2.0应用客户端性能问题十大根源

原文 http://www.infoq.com/cn/news/2010/08/web-performance-root/ Web 2.0应用的推广为用户带来了全新的体验,同时也让开发人员更加关注客户端性能问题.最近,资深Web性能诊断专家.知名工具dynatrace的创始人之一Andreas Grabner根据自己的工作经验,总结了Web 2.0应用客户端性能问题十大根源,InfoQ中文站将这十个问题做了概括整理,供Web开发人员借鉴和思考. 1. IE中的CSS选择器(selector)运行

Web 2.0应用客户端性能问题十大根源《转载》

前言 Web 2.0应用的推广为用户带来了全新的体验,同时也让开发人员更加关注客户端性能问题.最近,资深Web性能诊断专家.知名工具dynatrace的创始人之一Andreas Grabner根据自己的工作经验,总结了Web 2.0应用客户端性能问题十大根源,InfoQ中文站将这十个问题做了概括整理,供Web开发人员借鉴和思考. 1. IE中的CSS选择器(selector)运行缓慢 Web开发人员通常使用JavaScript框架(如jQuery)提供的CSS选择器来实现查找功能,如var el

CXF3.0.4客户端调用Web service来获取服务的三种方式

服务端的代码请看我的另一篇文章:点击打开链接 首先必须要有一个可用的WSDL服务地址,这个地址有我们需要调用的方法,将地址复制到浏览器地址栏,看测试能否通过. 方式一: 配置CXF环境变量,用wsdl2java工具自动产生代码.项目结构目录如下: 其中com.yq.webservice下面的所有的java类都是wsdl2java工具自动产生的. <span style="font-size:18px;"><span style="white-space:pr

【python】网络编程-SocketServer 实现客户端与服务器间非阻塞通信

利用SocketServer模块来实现网络客户端与服务器并发连接非阻塞通信.首先,先了解下SocketServer模块中可供使用的类:BaseServer:包含服务器的核心功能与混合(mix-in)类挂钩:这个类只用于派生,所以不会生成这个类的实例:可以考虑使用TCPServer和UDPServer.TCPServer/UDPServer:基本的网络同步TCP/UDP服务器.UnixStreamServer/ UnixDatagramServer:基本的基于文件同步TCP/UDP服务器.Fork

Python的异步编程[0] -&gt; 协程[1] -&gt; 使用协程建立自己的异步非阻塞模型

使用协程建立自己的异步非阻塞模型 接下来例子中,将使用纯粹的Python编码搭建一个异步模型,相当于自己构建的一个asyncio模块,这也许能对asyncio模块底层实现的理解有更大的帮助.主要参考为文末的链接,以及自己的补充理解. 完整代码 1 #!/usr/bin/python 2 # ============================================================= 3 # File Name: async_base.py 4 # Author: L

《Python》网络编程之验证客户端链接的合法性、socketserver模块

一.socket的更多方法介绍 # 服务端套接字函数 s.bind() # 绑定(主机,端口号)到套接字 s.listen() # 开始TCP监听 s.accept() # 被动接受TCP客户的连接,(阻塞式)等待连接的到来 # 客户端套接字函数 s.connect() # 主动初始化TCP服务器连接 s.connect_ex() # connect()函数的扩展版本,出错时返回出错码,而不是抛出异常 # 公共用途的套接字函数 s.recv() # 接收TCP数据 s.send() # 发送TC

APM最佳实践:Web 2.0和AJAX四大优化战略

本文针对当下市面上存在的监控技术问题给出的四点四大优化意见,旨在帮助Web 2.0开发者有效利用APM解决方案解决上述难题的. 随着Web应用程序速度与效率快速增长,网站已经成为企业与其客户进行交互的第一途径——某些情况下甚至成为惟一途径.在线电子商务网站的爆炸式发展就是这种情况的集中体现. 根据Forrester研究公司最新发布的报告,美国国内在线零售业务总销量至2017年将达到3700亿美元——相当于在未来几年内美国在线零售业年度复合增长率都将保持10%以上.为了保持自身竞争力,经营实体店铺

在你眼中 Web 3.0 是什么?(转)

web 3.0 Web 3.0一词包含多层含义,用来概括互联网发展过程中某一阶段可能出现的各种不同的方向和特征.Web 3.0 充满了争议和分歧,它到底应该什么样?具体的标志点又是什么? Web 2.0日益健全完善的今天,何时何事才是Web 3.0的标志尤未可知,也许时间才能给我们答案!但是毫无疑问的是,谁能够引领web 3.0,并且向前发展走向web 4.0的时代,谁就是网络的下一任主角! 中文名 web 3.0 概    念 概括联网发展过程中的方向和特征 首次提出者 Jeffrey Zel

python+web编程学习总结记录(一)

近来一个多星期一直在学习py的web编程,从零开始,短暂时间接受的很多知识都需要消化吸收,所以在这里把这个过程梳理一遍,尽量用自己的语言去描述这些知识点. 首先是web编程的必备知识:HTTP协议.超文本传输协议(HTTP),是一种通信协议,按照定义来直接去看容易一头雾水,但其实只需要了解:web服务器和客户端之间交流,必须要遵守统一的规矩,不然就跟你说汉语我说英文一样,互相不知对方在说什么.这个统一的规矩或者格式就是HTTP协议 而服务器和客户端之间的通信方式简而言之就是,客户端给服务器发了一