数据集来源:http://www.sogou.com/labs/resource/cs.php
目的:得到title集合文本,content集合文本
代码:
#python2 import chardet with open("news_sohusite_xml.dat",‘r‘) as h: x=h.readlines() # print(x[3]) topics=x[3::6] print(len(topics)) contents=x[4::6] type = chardet.detect(x[3]) print(type) # a = topics[0].decode(type["encoding"]) for i in topics: with open("sohusite_topics.txt","a") as f_out: f_out.write(i[14:-16].decode("gb18030").encode("utf-8")+‘\n‘) # f_out.write(i[14:-16].decode(type["encoding"]).encode("utf-8")+‘\n‘) for i in contents: with open("sohusite_contents.txt","a") as f_outt: f_outt.write(i[9:-11].decode("gb18030").encode("utf-8")+‘\n‘)
解码编码上花了点时间:原本用chardet.detect可以得到文本编码是gb2312,但是decode的时候会报错:
UnicodeDecodeError :‘gb2312‘ codec can‘t decode bytes:illegal multibyte sequence
原文地址:https://www.cnblogs.com/helloworld0604/p/9492682.html
时间: 2024-10-25 06:07:07