python 自然语言处理（二）____获得文本语料和词汇资源

一, 获取文本语料库

　　一个文本语料库是一大段文本。它通常包含多个单独的文本，但为了处理方便，我们把他们头尾连接起来当做一个文本对待。

1. 古腾堡语料库

　　nltk包含古腾堡项目（Project Gutenberg）电子文本档案的一小部分文本。要使用该语料库通常需要用Python解释器加载nltk包，然后尝试nltk.corpus.gutenberg.fileids().实例如下：

1 >>> import nltk
2 >>> nltk.corpus.gutenberg.fileids()
3 [‘austen-emma.txt‘, ‘austen-persuasion.txt‘, ‘austen-sense.txt‘, ‘bible-kjv.txt‘
4 , ‘blake-poems.txt‘, ‘bryant-stories.txt‘, ‘burgess-busterbrown.txt‘, ‘carroll-a
5 lice.txt‘, ‘chesterton-ball.txt‘, ‘chesterton-brown.txt‘, ‘chesterton-thursday.t
6 xt‘, ‘edgeworth-parents.txt‘, ‘melville-moby_dick.txt‘, ‘milton-paradise.txt‘, ‘
7 shakespeare-caesar.txt‘, ‘shakespeare-hamlet.txt‘, ‘shakespeare-macbeth.txt‘, ‘w
8 hitman-leaves.txt‘]
9 >>>

运行结果显示的是nltk包含了该语料库的哪些文本。我们可以对其中的任意文本进行操作。

1）统计词数。实例如下：

1 >>> emma = nltk.corpus.gutenberg.words(‘austen-emma.txt‘)
2 >>> len(emma)
3 192427
4 >>>

2）索引文本。实例如下：

1 >>> emma = nltk.Text(nltk.corpus.gutenberg.words(‘austen-emma.txt‘))
2 >>> emma.concordance("surprise")
3 Displaying 1 of 1 matches:
4  that Emma could not but feel some surprise , and a little displeasure , on he
5 >>>

3）获取文本的标识符，词，句。实例如下：

279 >>> for fileid in gutenberg.fileids():
280 ...     raw = gutenberg.raw(fileid)
281 ...     num_chars = len(raw)
282 ...     words = gutenberg.words(fileid)
283 ...     num_words = len(words)
284 ...     sents = gutenberg.sents(fileid)
285 ...     num_sents = len(sents)
286 ...     vocab = set([w.lower() for w in gutenberg.words(fileid)])
287 ...     num_vocab = len(vocab)
288 ...     print("%d %d %d %s" % (num_chars, num_words, num_sents, fileid))
289 ...
290 887071 192427 7752 austen-emma.txt
291 466292 98171 3747 austen-persuasion.txt
292 673022 141576 4999 austen-sense.txt
293 4332554 1010654 30103 bible-kjv.txt
294 38153 8354 438 blake-poems.txt
295 249439 55563 2863 bryant-stories.txt
296 84663 18963 1054 burgess-busterbrown.txt
297 144395 34110 1703 carroll-alice.txt
298 457450 96996 4779 chesterton-ball.txt
299 406629 86063 3806 chesterton-brown.txt
300 320525 69213 3742 chesterton-thursday.txt
301 935158 210663 10230 edgeworth-parents.txt
302 1242990 260819 10059 melville-moby_dick.txt
303 468220 96825 1851 milton-paradise.txt
304 112310 25833 2163 shakespeare-caesar.txt
305 162881 37360 3106 shakespeare-hamlet.txt
306 100351 23140 1907 shakespeare-macbeth.txt
307 711215 154883 4250 whitman-leaves.txt
308
309 >>> raw[:1000]
310 "[Leaves of Grass by Walt Whitman 1855]\n\n\nCome, said my soul,\nSuch verses fo
311 r my Body let us write, (for we are one,)\nThat should I after return,\nOr, long
312 , long hence, in other spheres,\nThere to some group of mates the chants resumin
313 g,\n(Tallying Earth‘s soil, trees, winds, tumultuous waves,)\nEver with pleas‘d
314 smile I may keep on,\nEver and ever yet the verses owning--as, first, I here and
315  now\nSigning for Soul and Body, set to them my name,\n\nWalt Whitman\n\n\n\n[BO
316 OK I.  INSCRIPTIONS]\n\n}  One‘s-Self I Sing\n\nOne‘s-self I sing, a simple sepa
317 rate person,\nYet utter the word Democratic, the word En-Masse.\n\nOf physiology
318  from top to toe I sing,\nNot physiognomy alone nor brain alone is worthy for th
319 e Muse, I say\n    the Form complete is worthier far,\nThe Female equally with t
320 he Male I sing.\n\nOf Life immense in passion, pulse, and power,\nCheerful, for
321 freest action form‘d under the laws divine,\nThe Modern Man I sing.\n\n\n\n}  As
322  I Ponder‘d in Silence\n\nAs I ponder‘d in silence,\nReturning upon my poems, c"
323 >>>
324 >>> words
325 [‘[‘, ‘Leaves‘, ‘of‘, ‘Grass‘, ‘by‘, ‘Walt‘, ‘Whitman‘, ...]
326 >>> sents
327 [[‘[‘, ‘Leaves‘, ‘of‘, ‘Grass‘, ‘by‘, ‘Walt‘, ‘Whitman‘, ‘1855‘, ‘]‘], [‘Come‘,
328 ‘,‘, ‘said‘, ‘my‘, ‘soul‘, ‘,‘, ‘Such‘, ‘verses‘, ‘for‘, ‘my‘, ‘Body‘, ‘let‘, ‘u
329 s‘, ‘write‘, ‘,‘, ‘(‘, ‘for‘, ‘we‘, ‘are‘, ‘one‘, ‘,)‘, ‘That‘, ‘should‘, ‘I‘, ‘
330 after‘, ‘return‘, ‘,‘, ‘Or‘, ‘,‘, ‘long‘, ‘,‘, ‘long‘, ‘hence‘, ‘,‘, ‘in‘, ‘othe
331 r‘, ‘spheres‘, ‘,‘, ‘There‘, ‘to‘, ‘some‘, ‘group‘, ‘of‘, ‘mates‘, ‘the‘, ‘chant
332 s‘, ‘resuming‘, ‘,‘, ‘(‘, ‘Tallying‘, ‘Earth‘, "‘", ‘s‘, ‘soil‘, ‘,‘, ‘trees‘, ‘
333 ,‘, ‘winds‘, ‘,‘, ‘tumultuous‘, ‘waves‘, ‘,)‘, ‘Ever‘, ‘with‘, ‘pleas‘, "‘", ‘d‘
334 , ‘smile‘, ‘I‘, ‘may‘, ‘keep‘, ‘on‘, ‘,‘, ‘Ever‘, ‘and‘, ‘ever‘, ‘yet‘, ‘the‘, ‘
335 verses‘, ‘owning‘, ‘--‘, ‘as‘, ‘,‘, ‘first‘, ‘,‘, ‘I‘, ‘here‘, ‘and‘, ‘now‘, ‘Si
336 gning‘, ‘for‘, ‘Soul‘, ‘and‘, ‘Body‘, ‘,‘, ‘set‘, ‘to‘, ‘them‘, ‘my‘, ‘name‘, ‘,
337 ‘], ...]

raw表示的是文本中所有的标识符，words是词，sents是句子。显然句子都是划分成一个个词来进行存储的。除了words(), raw() 和 sents()以外，大多数nltk语料库阅读器还包括多种访问方法。

2. 网络和聊天文本

古腾堡项目包含的是成千上万的书籍，它们比较正式，代表了既定的文学。除此之外， nltk中还有很多的网络文本小集合，其内容包括Firefox交流论坛，在纽约无意中听到的对话，《加勒比海盗》的电影剧本，个人广告和葡萄酒的评论。访问该部分的文本实例如下：

 1 >>> for fileid in webtext.fileids():
 2 ...     print("%s   %s ..." % (fileid, webtext.raw(fileid)[:65]))
 3 ...
 4 firefox.txt   Cookie Manager: "Don‘t allow sites that set removed cookies to se
 5 ...
 6 grail.txt   SCENE 1: [wind] [clop clop clop]
 7 KING ARTHUR: Whoa there!  [clop ...
 8 overheard.txt   White guy: So, do you have any plans for this evening?
 9 Asian girl ...
10 pirates.txt   PIRATES OF THE CARRIBEAN: DEAD MAN‘S CHEST, by Ted Elliott & Terr
11 ...
12 singles.txt   25 SEXY MALE, seeks attrac older single lady, for discreet encoun
13 ...
14 wine.txt   Lovely delicate, fragrant Rhone wine. Polished leather and strawb ...
15
16 >>>

3. 即时消息聊天会话语料库

该语料库最初是由美国海军研究生院为研究自动检测互联网入侵者而收集的，包含超过1000个帖子，被分成15个文件，每个文件包含几百个从特定日期和特定年龄的聊天室收集的帖子。文件名包含日期，聊天室和帖子的数量。引用实例如下：

4.布朗语料库

布朗语料库是第一个百万词级的英语电子语料库，其中包含500个不同来源的文本，按照文体分类，如新闻，社论等。它主要用于研究文体之间的系统性差异（又叫做文体学的语言学研究）。我们可以将语料库作为词链表或者句子链表来访问。

1）按特定类别或文件阅读

 1 >>> from nltk.corpus import brown
 2 >>> brown.categories()
 3 [‘adventure‘, ‘belles_lettres‘, ‘editorial‘, ‘fiction‘, ‘government‘, ‘hobbies‘,
 4  ‘humor‘, ‘learned‘, ‘lore‘, ‘mystery‘, ‘new‘, ‘news‘, ‘religion‘, ‘reviews‘, ‘r
 5 omance‘, ‘science_fiction‘]
 6 >>> brown.words(categories=‘news‘)
 7 [‘The‘, ‘Fulton‘, ‘County‘, ‘Grand‘, ‘Jury‘, ‘said‘, ...]
 9 >>> brown.words(fileids=[‘cg22‘])
10 [‘Does‘, ‘our‘, ‘society‘, ‘have‘, ‘a‘, ‘runaway‘, ‘,‘, ...]
11 >>> brown.sents(categories=[‘news‘, ‘editorial‘, ‘reviews‘, ])
12 [[‘The‘, ‘Fulton‘, ‘County‘, ‘Grand‘, ‘Jury‘, ‘said‘, ‘Friday‘, ‘an‘, ‘investiga
13 tion‘, ‘of‘, "Atlanta‘s", ‘recent‘, ‘primary‘, ‘election‘, ‘produced‘, ‘``‘, ‘no
14 ‘, ‘evidence‘, "‘‘", ‘that‘, ‘any‘, ‘irregularities‘, ‘took‘, ‘place‘, ‘.‘], [‘T
15 he‘, ‘jury‘, ‘further‘, ‘said‘, ‘in‘, ‘term-end‘, ‘presentments‘, ‘that‘, ‘the‘,
16  ‘City‘, ‘Executive‘, ‘Committee‘, ‘,‘, ‘which‘, ‘had‘, ‘over-all‘, ‘charge‘, ‘o
17 f‘, ‘the‘, ‘election‘, ‘,‘, ‘``‘, ‘deserves‘, ‘the‘, ‘praise‘, ‘and‘, ‘thanks‘,
18 ‘of‘, ‘the‘, ‘City‘, ‘of‘, ‘Atlanta‘, "‘‘", ‘for‘, ‘the‘, ‘manner‘, ‘in‘, ‘which
19 ‘, ‘the‘, ‘election‘, ‘was‘, ‘conducted‘, ‘.‘], ...]
20 >>>

2）比较不同文体之间情态动词的用法

 1 >>> from nltk.corpus import brown
 2 >>> news_text = brown.words(categories=‘news‘)
 3 >>> fdist=nltk.FreqDist([w.lower() for w in news_text])
 4 >>> modals = [‘can‘, ‘could‘, ‘may‘, ‘might‘, ‘must‘, ‘will‘]
 5 >>> for m in modals:
 6 ...     print("%s:%d" %(m, fdist[m]))
 7 ...
 8 can:94
 9 could:87
10 may:93
11 might:38
12 must:53
13 will:389
14 >>>

5. 路透社语料库

时间： 2024-10-04 23:10:44

python 自然语言处理（二）____获得文本语料和词汇资源

python 自然语言处理（二）____获得文本语料和词汇资源的相关文章

python+NLTK 自然语言学习处理四：获取文本语料和词汇资源

获得文本语料和词汇资源

《Python自然语言处理》

python 自然语言处理（四）____词典资源

Python自然语言处理工具小结

机器学习经典算法详解及Python实现---朴素贝叶斯分类及其在文本分类、垃圾邮件检测中的应用

python自然语言处理1——从网络抓取数据

Python爬虫实战二之爬取百度贴吧帖子

Python服务器开发二：Python网络基础