目录:
1.程序
1.程序
2.nonesense words
#word遍历行,word处理,去除各种标点,怪异符号,最后计算word出现个数
#判断单词是否是在停用词列表
def none_sense_words_list_judeg(word1):
file_noneSense_words=open("nonesense_inhanced_words.txt",‘r‘)
list_noneSense_words=[]
for line in file_noneSense_words:
line_list=line.split()
for word in line_list:
list_noneSense_words.append(word)
if word1 not in list_noneSense_words:
return word1
import string
def processLine(line,wcDict):
line=line.strip()
wordList=line.split()
for word in wordList:
if word !=‘--‘ and not word.isdigit():
word=word.lower()
word=word.strip()
word=word.strip(string.punctuation)
if word==none_sense_words_list_judeg(word):
addWord(word,wcDict)
#计算单词个数
def addWord(w,wcDict):
if w in wcDict:
wcDict[w]+=1
else:
wcDict[w]=1
#美观输出
def prettyPrint(wcDict):
‘‘‘
>>> prettyPrint(wcDict)
23
‘‘‘
valKeyList=[]
for key,val in wcDict.items():
if val>2 and len(key)>3:
valKeyList.append((val,key)) #注意为了方便排序,把val,key换了方向,生成新的列表valKeyList
valKeyList.sort(reverse=True) #sort(reverse=True)值由高到低排序
print ‘%-10s%10s‘%(‘word‘,‘count‘)
print ‘-‘*21
for val,key in valKeyList:
print "%-12s %3d"%(key,val) #美观输出的时候,又调换顺序,key,val顺序输出
def main():
#测试main()函数,只能在main()下面用doctest测试
‘‘‘
>>> main()
3
‘‘‘
wcDict={}
fObj=open(‘article.txt‘,‘r‘)
for line in fObj:
processLine(line,wcDict,)
prettyPrint(wcDict)
main()
2.nonesense words
asymptomatic
chronically
doses
high-risk
previous
definitions
developmentally
detected
possible
necessary
antigen
infections
birth
vaccinated
clinical
2012
difinitions
acute
negative
antibody
acute
symptoms
infants
health
levels
status
safety
results
populations
licensed
partners
partner
performed
recommends
given
following
determine
decline
treatment
immunization
facilities
liver
certain
high
exposure
chronic
person
persons
infection
vaccine
vaccines
type
reported
recommendations
occur
days
contacts
countries
appears
adult
adults
combination
normal
mother
mothers
incidence
hbv-infected
evaluation
disabled
unvaccinated
vaccination
second
remain
recent
rate
public
pregnant
long
time
test
site
women
case
cases
core
drug
users
services
sharing
known
injection
increased
household
response
protection
soon
signs
sexually
estimated
workers
infected
surface
generally
combined
born
long-term
positive
used
care
receive
infectious
immunity
including
children
recommended
indicates
immune
virus
body
weeks
blood
available
series
patients
risk
month
months
states
united
disease
years
testing
to
can
could
dare
do
did
does
may
might
would
should
must
will
ought
shall
need
is
a
am
are
about
according
after
against
all
almost
also
although
among
an
and
another
any
anything
approximately
as
asked
at
back
because
before
besides
between
both
but
by
call
called
currently
despite
did
do
dr
during
each
earlier
eight
even
eventually
every
everything
five
for
four
from
he
her
here
his
how
however
i
if
in
indeed
instead
it
its
just
last
like
major
many
may
maybe
meanwhile
more
moreover
most
mr
mrs
ms
much
my
neither
net
never
nevertheless
nine
no
none
not
nothing
now
of
on
once
one
only
or
other
our
over
partly
perhaps
prior
regarding
separately
seven
several
she
should
similarly
since
six
so
some
somehow
still
such
ten
that
the
their
then
there
therefore
these
they
this
those
though
three
to
two
under
unless
unlike
until
volume
we
what
whatever
whats
when
where
which
while
why
with
without
yesterday
yet
you
your
aboard
about
above
according to
across
afore
after
against
agin
along
alongside
amid
amidst
among
amongst
anent
around
as
aslant
astride
at
athwart
bar
because of
before
behind
below
beneath
beside
besides
between
betwixt
beyond
but
by
circa
despite
down
during
due to
ere
except
for
from
in
inside
into
less
like
mid
midst
minus
near
next
nigh
nigher
nighest
notwithstanding
of
off
on
on to
onto
out
out of
outside
over
past
pending
per
plus
qua
re
round
sans
save
since
through
throughout
thru
till
to
toward
towards
under
underneath
unlike
until
unto
up
upon
versus
via
vice
with
within
without
he
her
herself
hers
him
himself
his
I
it
its
itself
me
mine
my
myself
ours
she
their
theirs
them
themselves
they
us
we
our
ourselves
you
your
yours
yourselves
yourself
this
that
these
those
"
‘
‘‘
(
)
*LRB*
*RRB*
<dquote>
<ldquo>
<lsquo>
<rdquo>
<rsquo>
@
&
[
]
`
``
e.g.,
{
}
"
“
”
-RRB-
-LRB-
--
a
about
above
across
after
afterwards
again
against
all
almost
alone
along
already
also
although
always
am
among
amongst
amoungst
amount
an
and
another
any
anyhow
anyone
anything
anyway
anywhere
are
around
as
at
back
be
became
because
become
becomes
becoming
been
before
beforehand
behind
being
below
beside
besides
between
beyond
bill
both
bottom
but
by
call
can
cannot
cant
co
computer
con
could
couldnt
cry
de
describe
detail
do
done
down
due
during
each
eg
eight
either
eleven
else
elsewhere
empty
enough
etc
even
ever
every
everyone
everything
everywhere
except
few
fifteen
fify
fill
find
fire
first
five
for
former
formerly
forty
found
four
from
front
full
further
get
give
go
had
has
hasnt
have
he
hence
her
here
hereafter
hereby
herein
hereupon
hers
herself
him
himself
his
how
however
hundred
i
ie
if
in
inc
indeed
interest
into
is
it
its
itself
keep
last
latter
latterly
least
less
ltd
made
many
may
me
meanwhile
might
mill
mine
more
moreover
most
mostly
move
much
must
my
myself
name
namely
neither
never
nevertheless
next
nine
no
nobody
none
noone
nor
not
nothing
now
nowhere
of
off
often
on
once
one
only
onto
or
other
others
otherwise
our
ours
ourselves
out
over
own
part
per
perhaps
please
put
rather
re
same
see
seem
seemed
seeming
seems
serious
several
she
should
show
side
since
sincere
six
sixty
so
some
somehow
someone
something
sometime
sometimes
somewhere
still
such
system
take
ten
than
that
the
their
them
themselves
then
thence
there
thereafter
thereby
therefore
therein
thereupon
these
they
thick
thin
third
this
those
though
three
through
throughout
thru
thus
to
together
too
top
toward
towards
twelve
twenty
two
un
under
until
up
upon
us
very
via
was
we
well
were
what
whatever
when
whence
whenever
where
whereafter
whereas
whereby
wherein
whereupon
wherever
whether
which
while
whither
who
whoever
whole
whom
whose
why
will
with
within
without
would
yet
you
your
yours
yourself
yourselves