高频词统计器

目录:

1.程序

1.程序

2.nonesense words

#word遍历行,word处理,去除各种标点,怪异符号,最后计算word出现个数
#判断单词是否是在停用词列表

def none_sense_words_list_judeg(word1):
    file_noneSense_words=open("nonesense_inhanced_words.txt",‘r‘)
    list_noneSense_words=[]
    for line in file_noneSense_words:
        line_list=line.split()
        for word in line_list:
            list_noneSense_words.append(word)
    if word1 not in list_noneSense_words:
        return word1

import string
def processLine(line,wcDict):
    line=line.strip()
    wordList=line.split()
    for word in wordList:
        if word !=‘--‘ and not word.isdigit():
            
            word=word.lower()
            word=word.strip()
            word=word.strip(string.punctuation)
            if word==none_sense_words_list_judeg(word):
                addWord(word,wcDict)

#计算单词个数
def addWord(w,wcDict):
    if w in wcDict:
        wcDict[w]+=1
    else:
        wcDict[w]=1

#美观输出
def prettyPrint(wcDict):
    ‘‘‘
    >>> prettyPrint(wcDict)
    23
    ‘‘‘
    
    valKeyList=[]
    for key,val in wcDict.items():
        if val>2 and len(key)>3:
            valKeyList.append((val,key))             #注意为了方便排序,把val,key换了方向,生成新的列表valKeyList
    valKeyList.sort(reverse=True)               #sort(reverse=True)值由高到低排序
    print ‘%-10s%10s‘%(‘word‘,‘count‘)
    print ‘-‘*21
    for val,key in valKeyList:
        print "%-12s     %3d"%(key,val)          #美观输出的时候,又调换顺序,key,val顺序输出

def main():
    #测试main()函数,只能在main()下面用doctest测试
    ‘‘‘       
    >>> main()
    3
    ‘‘‘
    
    wcDict={}
    fObj=open(‘article.txt‘,‘r‘)

for line in fObj:
        processLine(line,wcDict,)

prettyPrint(wcDict)

main()

2.nonesense words

asymptomatic
chronically
doses
high-risk
previous
definitions
developmentally
detected
possible
necessary
antigen
infections
birth
vaccinated
clinical
2012
difinitions
acute
negative
antibody
acute
symptoms
infants
health
levels
status
safety
results
populations
licensed
partners
partner
performed
recommends
given
following
determine
decline
treatment
immunization
facilities
liver
certain
high
exposure
chronic
person
persons
infection
vaccine
vaccines
type
reported
recommendations
occur
days
contacts
countries
appears
adult
adults
combination
normal
mother
mothers
incidence
hbv-infected
evaluation
disabled
unvaccinated
vaccination
second
remain
recent
rate
public
pregnant
long
time
test
site
women
case
cases
core
drug
users
services
sharing
known
injection
increased
household
response
protection
soon
signs
sexually
estimated
workers
infected
surface
generally
combined
born
long-term
positive
used
care
receive
infectious
immunity
including

children
recommended
indicates
immune
virus
body
weeks
blood
available
series
patients
risk
month
months
states
united
disease
years
testing
to
can
could
dare
do
did
does
may
might
would
should
must
will
ought
shall
need
is
a
am
are
about
according
after
against
all
almost
also
although
among
an
and
another
any
anything
approximately
as
asked
at
back
because
before
besides
between
both
but
by
call
called
currently
despite
did
do
dr
during
each
earlier
eight
even
eventually
every
everything
five
for
four
from
he
her
here
his
how
however
i
if
in
indeed
instead
it
its
just
last
like
major
many
may
maybe
meanwhile
more
moreover
most
mr
mrs
ms
much
my
neither
net
never
nevertheless
nine
no
none
not
nothing
now
of
on
once
one
only
or
other
our
over
partly
perhaps
prior
regarding
separately
seven
several
she
should
similarly
since
six
so
some
somehow
still
such
ten
that
the
their
then
there
therefore
these
they
this
those
though
three
to
two
under
unless
unlike
until
volume
we
what
whatever
whats
when
where
which
while
why
with
without
yesterday
yet
you
your
aboard
about
above
according to
across
afore
after
against
agin
along
alongside
amid
amidst
among
amongst
anent
around
as
aslant
astride
at
athwart
bar
because of
before
behind
below
beneath
beside
besides
between
betwixt
beyond
but
by
circa
despite
down
during
due to
ere
except
for
from
in
inside
into
less
like
mid
midst
minus
near
next
nigh
nigher
nighest
notwithstanding
of
off
on
on to
onto
out
out of
outside
over
past
pending
per
plus
qua
re
round
sans
save
since
through
throughout
thru
till
to
toward
towards
under
underneath
unlike
until
unto
up
upon
versus
via
vice
with
within
without
he
her
herself
hers
him
himself
his
I
it
its
itself
me
mine
my
myself
ours
she
their
theirs
them
themselves
they
us
we
our
ourselves
you
your
yours
yourselves
yourself
this
that
these
those
"

‘‘
(
)
*LRB*
*RRB*
<dquote>
<ldquo>
<lsquo>
<rdquo>
<rsquo>
@
&
[
]
`
``
e.g.,
{
}
&quot;
&ldquo;
&rdquo;
-RRB-
-LRB-
--
a
about
above
across
after
afterwards
again
against
all
almost
alone
along
already
also
although
always
am
among
amongst
amoungst
amount
an
and
another
any
anyhow
anyone
anything
anyway
anywhere
are
around
as
at
back
be
became
because
become
becomes
becoming
been
before
beforehand
behind
being
below
beside
besides
between
beyond
bill
both
bottom
but
by
call
can
cannot
cant
co
computer
con
could
couldnt
cry
de
describe
detail
do
done
down
due
during
each
eg
eight
either
eleven
else
elsewhere
empty
enough
etc
even
ever
every
everyone
everything
everywhere
except
few
fifteen
fify
fill
find
fire
first
five
for
former
formerly
forty
found
four
from
front
full
further
get
give
go
had
has
hasnt
have
he
hence
her
here
hereafter
hereby
herein
hereupon
hers
herself
him
himself
his
how
however
hundred
i
ie
if
in
inc
indeed
interest
into
is
it
its
itself
keep
last
latter
latterly
least
less
ltd
made
many
may
me
meanwhile
might
mill
mine
more
moreover
most
mostly
move
much
must
my
myself
name
namely
neither
never
nevertheless
next
nine
no
nobody
none
noone
nor
not
nothing
now
nowhere
of
off
often
on
once
one
only
onto
or
other
others
otherwise
our
ours
ourselves
out
over
own
part
per
perhaps
please
put
rather
re
same
see
seem
seemed
seeming
seems
serious
several
she
should
show
side
since
sincere
six
sixty
so
some
somehow
someone
something
sometime
sometimes
somewhere
still
such
system
take
ten
than
that
the
their
them
themselves
then
thence
there
thereafter
thereby
therefore
therein
thereupon
these
they
thick
thin
third
this
those
though
three
through
throughout
thru
thus
to
together
too
top
toward
towards
twelve
twenty
two
un
under
until
up
upon
us
very
via
was
we
well
were
what
whatever
when
whence
whenever
where
whereafter
whereas
whereby
wherein
whereupon
wherever
whether
which
while
whither
who
whoever
whole
whom
whose
why
will
with
within
without
would
yet
you
your
yours
yourself
yourselves

时间: 2024-10-07 13:15:32

高频词统计器的相关文章

Golang,用map写个单词统计器

Golang中也有实用的泛型编程模板.如map.据Go官方团队称,其实现为Hash表,而非类似cpp或Java的红黑树.所以理论上速度更能快上几个等级(Hash与红黑树的效率对比可以看我的文章C++中各种<string,T>关联方式的速度对比,效率比约为3:1),但有一些区别,就是遍历时,数据是无需且随机的(当然,后文会讲到有序化的方法).接下来,我们先创建一个map对象. dict:=make(map[string]int); 由于map的强类型,所以一切类型是静态的,map也不例外.从ma

效能分析——词频统计器(第二版)

第一次分析结果: 分析: 整个词频统计器的代码都放在了main()一个函数里,导致无法分析程序效能 改进: 将代码分块书写 问题: 大的TXT文件在devc++编译器下可以运行,在VS下运行出错

自然语言9_NLTK计算中文高频词

以下代码仅限于python2 NLTK计算中文高频词 >>> sinica_fd=nltk.FreqDist(sinica_treebank.words()) >>> top100=sinica_fd.items()[0:100] >>> for (x,y) in top100: print x,y 的 6776 . 1482 在 1331 是 1317 了 1190 有 759 我 724 他 688 就 627 上 612 和 580 也 542

运用jieba库 寻找高频词

一.准备 1.首先 先用cmd 安装 jieba库,输入 pip install jieba 2.其次 本次要用到wordcloud库和 matplotlib库,也在cmd输入pip install matplotlib和pip install wordcloud 二.安装完之后,输入如下代码 1 from wordcloud import WordCloud 2 import matplotlib.pyplot as plt 3 import jieba 4 def create_word_c

seo与python大数据结合给文本分词并提取高频词

最近研究seo和python如何结合,参考网上的一些资料,写的这个程序. 目的:分析某个行业(例如:圆柱模板)用户最关心的一些词,根据需求去自动调整TDK,以及栏目,内容页的规划 使用方法: 1.下载安装cygwin:http://www.cygwin.com/ 2.cygwin安装时别忘记安装curl,wget,iconv,lynx,dos2unix,Python等常用工具,特别是Python,这次主要就是用它了. 3.去下载jieba中文分词组件: 首选:https://github.com

nltp APP-分析买家评论的评分-高频词:二维关系

w # -*- coding: utf-8 -*- from nltk import * # TO FIX : No such file or directory os.chdir(r'E:\zpy') f = open('reviews_text_lt_3.txt', 'r') f_r = f.read() strList = f_r.split(' ') fdist1 = FreqDist(strList) #总的词数 print fdist1 #表达式 keys()为我们提供了文本中所有不

词频统计器

功能:统计一篇英文txt文章中的单词出现次数 1 #include<stdio.h> 2 #include<string.h> 3 #include<iostream> 4 #include<fstream> 5 #include<string> 6 #include<map> 7 #include <iomanip> 8 using namespace std; 9 int main() 10 { 11 FILE *fp

C#:webbrowser中伪造referer,为何对流量统计器无效?

使用webbrowser伪造referer的方法:webBrowser1.Navigate(url, "_self", null, "Referer:http://www.xxx.com") 这段时间一直研究怎么才能在 webbrowser中设置referer来路来伪造来路进行刷流量,可是最后研究了半个月最终以失败告终,因为现在的统计代码,比较实际的就是cnzz.com和google adsense自带的统计,他们的统计都是通过js文件进行统计的,这样就形成了伪造来

文本统计器(Java)

1. 创建一个类,实现统计文本文件中各类字符和字符串的个数的功能,要求实现: a) 按字符统计,输出各个字符的数量 b) 按单词统计,输出各个单词的数量 2. 在b)的基础上实现一个类keywordIdentifier,读入一个Java程序源文件,输出各个关键字的个数(注释中出现的关键字不计入关键字个数) 思考:统计1:如果文本文件只包含26个英文字母并且用空格分离,那么只需要使用数组就可以对字符计数,用空格分离得到字符串可以对字符串计数(是否区分大小写问题).如果文本文件是英文,但是包含各种标