用CMD测试代码的时候,因为CMD默认用gbk支持print(),当UTF字符集出现超出GBK编码的字符是就会出现:
UnicodeEncodeError: ‘gbk’ codec can’t encode character u’\u200e’ in position 43: illegal multibyte sequence
可以在decode时,增加参数ignore对错误进行忽略。
bytes.decode(encoding="utf-8", errors="strict")bytearray.decode(encoding="utf-8", errors="strict")
Return a string decoded from the given bytes. Default encoding is ‘utf-8‘. errors may be given to set a different error handling scheme. The default for errors is ‘strict‘, meaning that encoding errors raise a UnicodeError. Other possible values are ‘ignore‘, ‘replace‘ and any other name registered via codecs.register_error(), see section Error Handlers. For a list of possible encodings, see section Standard Encodings.
import urllib.request from bs4 import BeautifulSoup def trade_spider(max_pages): headers = {‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11‘, ‘Accept‘:‘text/html;q=0.9,*/*;q=0.8‘ } opener = urllib.request.build_opener() opener.addheaders = [headers] page=1 while page <= max_pages: url=r‘http://news.zjicm.edu.cn/web_2/pages/type.php?Page_Id=8C2D81A53403FED2AEE8F706017F8C3E&PageNo=‘ + str(page) source_code=data = opener.open(url).read() soup=BeautifulSoup(source_code,"html.parser") for link in soup.find(class_=‘list-nopic‘).find_all(‘a‘): str1=link.string.encode(‘gbk‘,errors=‘ignore‘) print(str1.decode(‘gbk‘,errors=‘ignore‘)) page += 1 trade_spider(10)
时间: 2024-10-06 21:53:47