直接上干货!!
采用python 2.7.5-windows
打开 http://www.apple.com/cn/itunes/charts/free-apps/
如上图可以见采用的是utf-8 编码
经过一番思想斗争 编码如下 (拍砖别打脸)
#coding=utf-8 import urllib2 import urllib import re import thread import time #----------- APP store 排行榜 ----------- class Spider_Model: def __init__(self): self.page = 1 self.pages = [] self.enable = False def GetCon(self): myUrl = "http://www.apple.com/cn/itunes/charts/free-apps/" user_agent = ‘Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)‘ headers = { ‘User-Agent‘ : user_agent } req = urllib2.Request(myUrl, headers = headers) myResponse = urllib2.urlopen(req) myPage = myResponse.read() #encode的作用是将unicode编码转换成其他编码的字符串 #decode的作用是将其他编码的字符串转换成unicode编码 print myPage print ‘ ‘ myModel = Spider_Model() myModel.GetCon()
采集页面字符集 python文件字符集统一为utf-8 (贫蛋哥是认为没啥问题的)
打印输出结果:
拿出杀手锏 www.baidu.com
找到原因:
http://blog.csdn.net/lf8289/article/details/2465196
http://www.crifan.com/unicodeencodeerror_gbk_codec_can_not_encode_character_in_position_illegal_multibyte_sequence/
各种狂改中.......
#coding=gbk 编码修改为gbk import urllib2 import urllib import re import thread import time #----------- APP store 排行榜 ----------- class Spider_Model: def __init__(self): self.page = 1 self.pages = [] self.enable = False def GetCon(self): myUrl = "http://www.apple.com/cn/itunes/charts/free-apps/" user_agent = ‘Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)‘ headers = { ‘User-Agent‘ : user_agent } req = urllib2.Request(myUrl, headers = headers) myResponse = urllib2.urlopen(req) myPage = myResponse.read() #encode的作用是将unicode编码转换成其他编码的字符串 #decode的作用是将其他编码的字符串转换成unicode编码 unicodePage = myPage.decode(‘utf-8‘).encode(‘gbk‘,‘ignore‘) #采集页面编码为utf-8 转为 gbk (ignore来忽略非法的字符) print unicodePage
print ‘ ‘ myModel = Spider_Model() myModel.GetCon()
运行结果:
初识python之 APP store排行榜 蜘蛛抓取(一),布布扣,bubuko.com
时间: 2024-12-19 14:09:14