1. 词频统计:
1 import jieba 2 txt = open("threekingdoms3.txt", "r", encoding=‘utf-8‘).read() 3 words = jieba.lcut(txt) 4 counts = {} 5 for word in words: 6 if len(word) == 1: 7 continue 8 else: 9 counts[word] = counts.get(word,0) + 1 10 items = list(counts.items()) 11 items.sort(key=lambda x:x[1], reverse=True) 12 for i in range(15): 13 word, count = items[i] 14 print ("{0:<10}{1:>5}".format(word, count))
结果是:
曹操 946
孔明 737
将军 622
玄德 585
却说 534
关公 509
荆州 413
二人 410
丞相 405
玄德曰 390
不可 387
孔明曰 374
张飞 358
如此 320
不能 318
进一步改进, 我想只知道人物出场统计,代码如下:
1 import jieba 2 txt = open("threekingdoms3.txt", "r", encoding=‘utf-8‘).read() 3 names = {‘曹操‘,‘孔明‘,‘刘备‘,‘关羽‘,‘张飞‘,‘吕布‘,‘赵云‘,‘孙权‘,‘周瑜‘,‘袁绍‘,‘黄忠‘,‘魏延‘} 4 words = jieba.lcut(txt) 5 counts = {} 6 for word in words: 7 if len(word) == 1: 8 continue 9 elif word == "诸葛亮" or word == "孔明曰": 10 rword = "孔明" 11 elif word == "关公" or word == "云长": 12 rword = "关羽" 13 elif word == "玄德" or word == "玄德曰": 14 rword = "刘备" 15 elif word == "孟德" or word == "丞相": 16 rword = "曹操" 17 else: 18 rword = word 19 counts[rword] = counts.get(rword,0) + 1 20 # for word in excludes: 21 # del counts[word] 22 items = list(counts.items()) 23 items.sort(key=lambda x:x[1], reverse=True) 24 for i in range(40): 25 word, count = items[i] 26 if word in names: 27 print ("{0:<10}{1:>5}".format(word, count))
运行结果为:
曹操 1358
孔明 1265
刘备 1251
关羽 783
张飞 358
吕布 300
赵云 278
孙权 257
周瑜 217
袁绍 191
进一步的做词云图:
1 import jieba 2 import os 3 import wordcloud 4 5 def getText(file): 6 with open(file, ‘r‘, encoding= ‘UTF-8‘) as txt: 7 txt = txt.read() 8 jieba.lcut(txt) 9 return txt 10 11 12 directoryname = os.getcwd() 13 filename = input() 14 txt = getText(filename + ‘.txt‘) 15 wordclouds = wordcloud.WordCloud(width=1000, height= 800, margin=2).generate(txt) 16 wordclouds.to_file(‘{}.png‘.format(filename)) 17 18 os.system(‘{}.png‘.format(filename))
名称是可以进一步优化的,参见第二部分代码。
中文wordcloud库默认会出现乱码,解决方法参考 https://blog.csdn.net/Dick633/article/details/80261233
参考:https://blog.csdn.net/weixin_44521703/article/details/93058003
原文地址:https://www.cnblogs.com/116970u/p/11611821.html
时间: 2024-10-11 16:09:16