python3中使用builtwith的方法（很详细）

1. 首先通过pip install builtwith安装builtwith

C:\Users\Administrator>pip install builtwith
Collecting builtwith
  Downloading builtwith-1.3.2.tar.gz
Installing collected packages: builtwith
  Running setup.py install for builtwith ... done
Successfully installed builtwith-1.3.2

2. 在pycharm中新建工程并输入下面测试代码

import builtwith
tech_used = builtwith.parse(‘http://www.baidu.com‘)
print(tech_used)

运行会得到下面的错误：

C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy
Traceback (most recent call last):
  File "F:/python/first/FirstPy", line 1, in <module>
    import builtwith
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 43
    except Exception, e:
                    ^
SyntaxError: invalid syntax  

Process finished with exit code 1

原因是builtwith是基于2.x版本的，需要修改几个地方，在pycharm出错信息中双击出错文件，进行修改，主要修改下面三种：
1. Python2中的 “Exception ,e”的写法已经不支持，需要修改为“Exception as e”。
2. Python2中print后的表达式在Python3中都需要用括号括起来。
3. builtwith中使用的是Python2中的urllib2工具包，这个工具包在Python3中是不存在的，需要修改urllib2相关的代码。
1和2容易修改，下面主要针对第3点进行修改：
首先将import urllib2替换为下面的代码：

import urllib.request
import urllib.error

然后将urllib2的相关方法替换如下：

request = urllib.request.Request(url, None, {‘User-Agent‘: user_agent})
response = urllib.request.urlopen(request)

再次运行项目，遇到下面错误：

C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy
Traceback (most recent call last):
  File "F:/python/first/FirstPy", line 3, in <module>
    builtwith.parse(‘http://www.baidu.com‘)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 62,
in builtwith
    if contains(html, snippet):
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 105,
in contains
    return re.compile(regex.split(‘\\;‘)[0], flags=re.IGNORECASE).search(v)
TypeError: cannot use a string pattern on a bytes-like object  

Process finished with exit code 1

这是因为urllib返回的数据格式已经发生了改变，需要进行转码，将下面的代码：

if html is None:
    html = response.read()

修改为

if html is None:
     html = response.read()
     html = html.decode(‘utf-8‘)

再次运行得到最终结果如下：

C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy
{‘javascript-frameworks‘: [‘jQuery‘]}  

Process finished with exit code 0

但是如果把网站换成 ‘www.163.com‘，运行再次报错如下：

C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy
Error: ‘utf-8‘ codec can‘t decode byte 0xcd in position 500: invalid continuation byte
Traceback (most recent call last):
  File "F:/python/first/FirstPy", line 2, in <module>
    tech_used = builtwith.parse(‘http://www.163.com‘)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 63,
in builtwith
    if contains(html, snippet):
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 106,
in contains
    return re.compile(regex.split(‘\\;‘)[0], flags=re.IGNORECASE).search(v)
TypeError: cannot use a string pattern on a bytes-like object  

Process finished with exit code 1

似乎还是编码的问题，将编码设置成 ‘GBK’，运行成功如下：

C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy
{‘web-servers‘: [‘Nginx‘]}  

Process finished with exit code 0

所以不同的网站需要用不同的解码方式么？下面介绍一种判别网站编码格式的方法。
我们需要安装一个叫chardet的工具包，如下：

C:\Users\Administrator>pip install chardet
Collecting chardet
  Downloading chardet-2.3.0-py2.py3-none-any.whl (180kB)
    100% |████████████████████████████████| 184kB 616kB/s
Installing collected packages: chardet
Successfully installed chardet-2.3.0  

C:\Users\Administrator>

将byte数据传入chardet的detect方法后会得到一个Dict，里面有两个值，一个是置信值，一个是编码方式

{‘encoding‘: ‘utf-8‘, ‘confidence‘: 0.99}

将builtwith对应的代码做下面修改：

encode_type = chardet.detect(html)
  if encode_type[‘encoding‘] == ‘utf-8‘:
    html = html.decode(‘utf-8‘)
  else:
    html = html.decode(‘gbk‘)

记得 import chardet！！！！
加入chardet判断字符编码的方式后，就能适配网站了~~~~

http://blog.csdn.net/fengzhizi76506/article/details/61617067

时间： 2024-10-11 16:14:05

python3中使用builtwith的方法（很详细）

python3中使用builtwith的方法（很详细）的相关文章

Python3中BeautifulSoup的使用方法

JavaScript 中 Date 对象 getFullYear()方法的详细解释

STL中的set使用方法详细！！！！

【python】Python3中出现'gbk' codec can't encode characte的成功解决方法？

S5中新增的Array方法详细说明

python3 中encode 和decode的使用方法。

Python3中出现UnicodeEncodeError: 'ascii' codec can't encode characters in ordinal not in range(128)的解决方法

Python3 中 Yield 理解与使用

在python3环境安装builtwith模块