自然语言处理（英语文章）———初略处理

这里利用2-gram模型来提取一篇英文演讲的初略的主题句子，这里是英文演讲的的链接：http://pythonscraping.com/files/inaugurationSpeech.txt

n-gram模型是指n个连续的单词组成的序列

以下贴出代码（基于python2.7），详情参考《python网络数据采集》

#-*- coding:utf-8 -*-
from urllib2 import urlopen
import re
import string
import operator

#单词清洗
def isCommon(ngram):
    ngrams=ngram.split(‘ ‘)
    #清洗以下没有意义的单词
    commonWords=[‘the‘, ‘be‘, ‘and‘, ‘of‘, ‘a‘, ‘in‘, ‘to‘, ‘have‘, ‘it‘, ‘i‘, ‘for‘, ‘you‘, ‘he‘,
                 ‘with‘, ‘on‘, ‘do‘, ‘say‘, ‘this‘, ‘they‘, ‘is‘, ‘an‘, ‘at‘, ‘but‘, ‘we‘, ‘his‘,
                 ‘from‘, ‘that‘, ‘not‘, ‘by‘, ‘she‘, ‘or‘, ‘what‘, ‘go‘, ‘their‘, ‘can‘, ‘who‘,
                 ‘get‘, ‘if‘, ‘would‘, ‘her‘, ‘all‘, ‘my‘, ‘make‘, ‘about‘, ‘know‘, ‘will‘, ‘as‘,
                 ‘up‘, ‘one‘, ‘time‘, ‘has‘, ‘been‘, ‘there‘, ‘year‘, ‘so‘, ‘think‘, ‘when‘, ‘which‘,
                 ‘them‘, ‘some‘, ‘me‘, ‘people‘, ‘take‘, ‘out‘, ‘into‘, ‘just‘, ‘see‘, ‘him‘, ‘your‘,
                 ‘come‘, ‘could‘, ‘now‘, ‘than‘, ‘like‘, ‘other‘, ‘how‘, ‘then‘, ‘its‘, ‘our‘, ‘two‘,
                 ‘more‘, ‘these‘, ‘want‘, ‘way‘, ‘look‘, ‘first‘, ‘also‘, ‘new‘, ‘because‘, ‘day‘, ‘use‘,
                 ‘no‘, ‘man‘, ‘find‘, ‘here‘, ‘thing‘, ‘give‘, ‘many‘, ‘well‘]
    #判断2-gram中是否存在要清洗的单词
    for word in ngrams:
        if word.lower() in commonWords:
            return False
    return True

#数据清洗
def cleanInput(input):
    #装换多个\n和空格为单个空格
    input=re.sub(‘\n+‘,‘ ‘,input)
    input=re.sub(‘\[[0-9]*\]‘,‘‘,input)
    input=re.sub(‘ +‘,‘ ‘,input)
    input=bytes(input.decode(‘utf-8‘))
    input=input.decode(‘ascii‘,‘ignore‘)
    cleanInput=[]
    input=input.split(‘ ‘)
    for item in input:
        #string.punctuation 去除所有符号：!"#$%&‘()*+,-./:;<=>[email protected][\]^_`{|}~
        item=item.strip(string.punctuation)
        if len(item)>1 or (item.lower()==‘a‘ or item.lower()==‘i‘):
            cleanInput.append(item)
    return cleanInput

#input为输入的整个字符串，n表示以几个字符作为参照，即n-gram
def ngrams(input,n):
    input=cleanInput(input)
    #声明一个数组
    output={}
    for i in range(len(input)-n-1):
        ngramTemp=‘ ‘.join(input[i:i+n])
        if isCommon(ngramTemp):
            if ngramTemp not in output:
                output[ngramTemp]=0
            output[ngramTemp]+=1
    return  output

html=urlopen(‘http://pythonscraping.com/files/inaugurationSpeech.txt‘).read().decode(‘utf-8‘)
content=str(html)
ngrams=ngrams(content,2)

#key=operator.itemgetter(0) 表示以字典中的key(字符首字母)为前提排序
#key=operator.itemgetter(1) 表示以字典中的value(数字)为前提排序
#reverse=True 表示降序输出
sortedNGrams=sorted(ngrams.items(),key=operator.itemgetter(1),reverse=True)
#输出有意义的2-gram的单词，以及它们出现的数据
print sortedNGrams

#获取上面的的2-gram单词
keywords=[]
for i in range(0,len(sortedNGrams)):
    word=sortedNGrams[i]
    #除去概率小于2的词组
    if int(word[1])>2:
        keywords.append(word[0])

#定义一个集合存取文章的所有句子
sentences=set()
#定义一个main_sentences来存储结果
main_sentences=set()
i=content.split(‘.‘)
for j in i:
    sentences.add(j)

for keyword in keywords:
    for sentence in sentences:
        #获取第一个存在该词组的句子
        b=sentence.find(keyword)
        if b!=-1:
            #除去句子里的\n和多余空格
            sentence=re.sub(" +"," ",sentence)
            sentence=re.sub("\n+","",sentence)
            main_sentences.add(sentence)
            break

for i in main_sentences:
    print i

获取的2-gram的词组为（出现次数大于2）：

[u‘United States‘, u‘General Government‘, u‘executive department‘, u‘legislative body‘, u‘Mr Jefferson‘, u‘Chief Magistrate‘, u‘called upon‘, u‘same causes‘, u‘whole country‘, u‘Government should‘]

输出的句子有点多，这里就不贴出来了，这只是初级处理这篇演讲。

时间： 2024-10-05 09:57:18

自然语言处理（英语文章）———初略处理的相关文章

18.如何自学Struts2之Struts2标签和集成初略总结篇

18.如何自学Struts2之Struts2标签和集成初略总结篇[视频] 之前写了一篇"打算做一个视频教程探讨如何自学计算机相关的技术",优酷上传不了,只好传到百度云上: http://pan.baidu.com/s/1kTDsa95 有问题可以直接回复这篇文章.

Spring简单的小例子SpringDemo，用于初略理解什么是Spring以及JavaBean的一些概念

一.开发前的准备两个开发包spring-framework-3.1.1.RELEASE-with-docs.zip和commons-logging-1.2-bin.zip,将它们解压,然后把Spring开发包下dist目录的所有包和commons-logging包下的commons-logging-1.1.1.jar复制到名为Spring3.1.1的文件夹下.那么Spring开发所需要的包就组织好了. 二.建立项目,导入包在项目节点上右键,Build Path/ADD Libraries/U

学习: Delphi FireMonkey 结构性初略分析

Delphi 下的FireMonkey,很好地实现了 DirectUI与跨平台.学习了解他,对DirectUI编程及项目的跨平台实现有一定帮助.虽然作为开发者个体,并不需要了解太多这些东西,只要求拿来能用能实现功能就行,但对 FireMonkey的学习分析,对自己程序设计思想的提升,会有一定帮助. 昨天用FireMonkey控件写了一个小例子,发现他的 Animation类在实现控件的小动画时,很高效,很灵活.初步印象是 FireMonkey的内核有很多值得学习的地方,尤其他的界面渲染上,可以深

文章缩略显示

网页中文章缩略介绍第二行超出内容缩略显示用的是-webkit-私有属性.用js已经解决.代码如下:text-overflow: -o-ellipsis-lastline;overflow: hidden;text-overflow: ellipsis;display: -webkit-box;-webkit-line-clamp: 2;-webkit-box-orient: vertical; 转载自网页.

Hadoop架构的初略总结（1）

Hadoop架构的初略总结(1) Hadoop是一个开源的分布式系统基础架构,此架构可以帮助用户可以在不了解分布式底层细节的情况下开发分布式程序. 首先我们要理清楚几个问题. 1.我们为什么需要Hadoop? 解: 简单来说,我们每天上网浏览,上街购物,都会产生数据.我们处于一个数据量呈爆发式增长的时代.我们需要对这些数据进行分析处理,以获得更多有价值的东西.而Hadoop应时代而生.其次我们应该比较了解传统型关系数据库跟Hadoop之间有何区别.这些在前面的Hadoop第二课我们都有所提到.

go语言小练习——给定英语文章统计单词数量

给定一篇英语文章,要求统计出所有单词的个数,并按一定次序输出.思路是利用go语言的map类型,以每个单词作为关键字存储数量信息,代码实现如下: 1 package main 2 3 import ( 4 "fmt" 5 "sort" 6 ) 7 8 func wordCounterV1(str string) { 9 /*定义变量*/ 10 stringSlice := str[:] 11 temp := str[:] 12 wordStatistic := mak

英语文章、常用短语部分摘选集锦

**=================文章====================** Touchy Topics 敏感话题 In North America when people meet each other for the first time, they talk about things like family, work, school or sports. They ask questions like "Do you have any brothers or sisters?&

Java实现英语文章词频统计

1.需求:对于给定的英文文章进行单词频率的统计 2.分析: (1)建立一个如下图所示的数据库表word_frequency用来存放单词和其对应数量 (2)Scanner输入要查询的英文文章存入String中 (3)对String根据空格进行拆分存入word_frequency表中,并统计相应数量 (4)对word_frequency表中的数据按照频率由大到小,频率相同的情况下按照字母表顺序排序并输出 3.具体实现代码: 4.输入语句:You should help to set the dinn

英语文章单词统计程序思路

主要思路:文件读入文章,用HashMap 来存放出现的单词的次数,Key 是要统计的单词,Value 是单词出现的次数.最后再按照 Key 的升序排列出来. { HashMap: 数组:采用一段连续的存储单元来存储数据.对于指定下标的查找,时间复杂度为O(1):通过给定值进行查找,需要遍历数组,逐一比对给定关键字和数组元素,时间复杂度为O(n),当然,对于有序数组,则可采用二分查找,插值查找,斐波那契查找等方式,可将查找复杂度提高为O(logn):对于一般的插入删除操作,涉及到数组元素的移动,其